Lecture 28: Outtro#
STATS 60 / STATS 160 / PSYCH 10
Concepts and Learning Goals:
Overview of what you learned this quarter!
tl;dr: three themes
What’s next if you’re stats-curious?
Remembering our journey#
Unit 1: Thinking about scale#
In statistics and data science, we are trying to use numbers to
Describe our observations
Quantify how confident we are
Numbers are only meaningful in context.
Is $ 10 billion a lot of money?
Forbes' real-time billionaires' list
GDP heat map, in billions
Unit 1 was about building tools for contextualizing numbers and thinking critically about scale:
Three questions for contextualizing numbers:
What type of number is this?
What can I compare this number to? Is it large or small compared to other similar values?
What would I have expected this number to be?
Ballpark estimates for estimating a number
Set up a simple model to compute the quantity approximately by break up the estimate into small parts
How many visitors go on guided tours at Stanford per year?
(# visitors / year) = (# days/ year) x (# tours/ day) x (# visitors / tour)
Approximate parts up to a factor of 10
Cost-benefit analysis: a simple model helps us make difficult decisions
Unit 2: Probability#
A mathematical framework for modeling uncertain scenarios.
Is an observed pattern meaningful, or just random noise?
The probability we learned was essential for:
Hypothesis testing
Confidence intervals for estimation
Understanding selection bias
Topics covered:
Probabilistic experiments
Formalism: sample spaces, outcomes, events, probabilities
Modeling almost everything with coinflips, dice, and bags of marbles
Drawing a marble from the bag.
Coincidences?
Even if an event is rare, it is likely to happen when you repeat an experiment many times
Example: the birthday paradox
Example: winning streak
Example from unit 4: multiple testing!
Computing probabilities
Coinflips and binomial coefficients
Later used in computing \(p\)-values
Law of the complement
Conditional probability
“Zooming in”
Zooming in on $B$. Image credit to Blitzstein and Hwang, chapter 2.
Bayes’ rule
Common mistakes in conditional probability:
Base rate fallacy: the conditional probability is not informative by itself (male-dominated sports)
\(\Pr[A \mid B] \neq \Pr[B \mid A]\) (distracted driving, gateway drugs)
Failing to condition on important information (OJ Simpson)
Generalizing from a biased sample or failing to realize you have conditioned (hot guys are jerks, selection bias)
Unit 3: Exploratory Data Analysis#
The entire dataset is TMI. How can we extract useful information?
Topics Covered:
Data visualization
Common graphic representations: pie chart, bar chart, time series, histogram, scatterplot
Best practices
Misleading and uninformative charts
Summaries of center: what is the one number that best summarizes the data?
mean, median
Variability: how similar are the different datapoints in the dataset?
Would you rather be given \(\$150\) or flip a coin for \(\$300\)?
Variance and standard deviation
Quantiles and gaps between quantiles
Correlation and correlation coefficient
The usefulness of a summary statistic depends on the data!
outliers
skew
multi-modal data
Unit 4: Correlation and Experiments#
How to analyze data for trends, and how to design experiments.
Effect of sample size on estimates
Sample vs. population
Sample size matters for estimation!
The standard deviation of the sample mean is \(\frac{\sigma}{\sqrt{n}}\)
To get \(10\) times more accurate, you need \(100\) times more samples.
Normal Approximation for the sample mean
Confidence intervals
68-95-99 rule
Selection bias dramatically affects estimates!
Gettysburg address experiment
Common sources of selection bias
Hypothesis testing: is my observation a real trend, or just noise?
Null hypothesis: a probability model for “just noise”
\(p\)-value: the chance we observed our outcome, or something more extreme, under the null hypothesis
Significance level, false positive rate, and multiple testing
Using simulation to compute \(p\)-values:
Testing for correlation
Potential outcomes model
Randomized controlled experiments
Drawbacks of observational studies
Correlation vs. causation and confounding/hidden variables
Effect of selection bias
Potential outcomes model
Unit 5: Machine Learning and Regression#
Statistics is often concerned with making predictions.
On observation \(x\), predict outcome \(y\).
\(x\) is symptoms/test results, \(y\) is diagnosis
\(x\) is SAT score, \(y\) is first-year GPA
\(x\) is weather now, \(y\) is weather later
We construct a simple model \(f\) so that \(f(x) = \hat{y}\), with the goal that \(\hat{y}\) is as close to \(y\) as possible.
Building models
It is easier to learn from examples than build a model by hand
“training” data
How to evaluate models: “training” vs. “testing”
Types of prediction problems:
regression
classification
text generation (next word prediction)
Examples of models
Linear and quadratic regression
\(k\)-nearest-neighbors
Markov text generators
Training data is everything!
Selection bias in training data leads to biased models
If \(x\) is far from all training examples, \(f(x)\) is probably not that accurate
More (good) data and better coverage improves performance
tl;dr: three themes#
The three major ideas that I want you to take away from this class.
Theme 1: Insight from simple models#
The world is complicated.
Answering a question exactly is overwelming and often impossible.
Strategy: construct a simple model of the situation.
At least within the simple model, we have the power to answer questions precisely and often quantitatively.
Ballpark estimates and cost-benefit analysis.
Hypothesis testing.
Machine learning and prediction.
Decision making in sports
With great power comes great responsibility.
Know the strengths and limitations of your model.
Theme 2: Conditioning matters#
We might understand an uncertain situation well, but everything can change if we condition!
Common mistakes in conditional probability
False positives for medical tests, distracted driving, OJ Simpson, male-dominated sports
Selection bias
Hot guys are jerks
Biased estimates
Biased ML predictions from biased training data
Multimodal data affects interpretation of summary statistics
Male vs. female penguin body mass
Does generic medical advice apply to you?
Salt, hypertension, men vs. women
Theme 3: Critical thinking is essential#
Once you specify the model, statistics can give precise answers.
Is our model good? Does it fit the situation?
Think critically! Don’t calculate blindly.
“When means mislead”
Usefulness of fundamental summary statistics (mean, median, standard deviation) depends on data (outliers, skew)
Correlation vs. causation
Confounding variables
Experimental design
Misleading graphs and figures
Multiple testing and \(p\)-hacking
We did a lot of “what does this mean in plain English?” exercises.
Thinking like this is important—I do this in my research and in my daily life constantly.
Even though a concept is formal and/or technical, we can and should try to really understand.
Feedback#
I want to make STATS60 great!
Please take a couple of minutes to give some feedback on the course this quarter.
I’m stats-curious. What’s next?#
If you like exploratory data analysis#
STATS 32: Introduction to R
Learn the basics of programming in R
Example/application-focused class
DATASCI 112: Principles of Data Science
Deeper dive into data visualization and data analysis
More machine learning: how to train and evaluate ML models
Practice programming basics
If you like experiments and hypothesis testing#
STATS 191: Introduction to Applied Statistics
Deeper dive into methods for data analysis and prediction
Applications to biology and social sciences
After taking probability theory,
STATS 200: Introduction to Theoretical Statistics
Hypothesis testing
Estimation and confidence intervals
Bayesian methods
Some theory of machine learning
If you like probability#
STATS 117: Introduction to probability theory
Dive into probability theory
Simple discrete models (coinflips, bags of marbles)
Continuous models (Normal)
STATS 118: Probablity theory for statistical inference
Deeper dive into probability theory
Theory behind the Normal approximation
Math behind some popular hypothesis tests
If you like machine learning#
CS 106EA: Exploring artifical intelligence
Training and evaluating ML models
How do neural networks work?
Challenges in ML: overfitting, bias, distribution shift
After taking MATH 51 and CS 106:
CS 129: Applied Machine Learning
More ML models:
logistic regression
support vector machines
deep learning
“Unsupervised learning”: clustering and feature discovery