Lecture 28: Outtro

Lecture 28: Outtro#

STATS 60 / STATS 160 / PSYCH 10

Concepts and Learning Goals:

Overview of what you learned this quarter!
tl;dr: three themes
What’s next if you’re stats-curious?

Remembering our journey#

Unit 1: Thinking about scale#

In statistics and data science, we are trying to use numbers to

Describe our observations
Quantify how confident we are

Numbers are only meaningful in context.

Is $ 10 billion a lot of money?

Forbes' real-time billionaires' list

GDP heat map, in billions

Unit 1 was about building tools for contextualizing numbers and thinking critically about scale:

Three questions for contextualizing numbers:
1. What type of number is this?
2. What can I compare this number to? Is it large or small compared to other similar values?
3. What would I have expected this number to be?
Ballpark estimates for estimating a number
1. Set up a simple model to compute the quantity approximately by break up the estimate into small parts
  - How many visitors go on guided tours at Stanford per year?
    
    (# visitors / year) = (# days/ year) x (# tours/ day) x (# visitors / tour)
2. Approximate parts up to a factor of 10
Cost-benefit analysis: a simple model helps us make difficult decisions

Unit 2: Probability#

A mathematical framework for modeling uncertain scenarios.

Is an observed pattern meaningful, or just random noise?

The probability we learned was essential for:

Hypothesis testing
Confidence intervals for estimation
Understanding selection bias

Topics covered:

Probabilistic experiments
- Formalism: sample spaces, outcomes, events, probabilities
- Modeling almost everything with coinflips, dice, and bags of marbles
  
  Drawing a marble from the bag.
Coincidences?
- Even if an event is rare, it is likely to happen when you repeat an experiment many times
- Example: the birthday paradox
- Example: winning streak
- Example from unit 4: multiple testing!
Computing probabilities
- Coinflips and binomial coefficients
  - Later used in computing $p$-values
- Law of the complement
Conditional probability
- “Zooming in”
  
  Zooming in on $B$. Image credit to Blitzstein and Hwang, chapter 2.
- Bayes’ rule
- Common mistakes in conditional probability:
  - Base rate fallacy: the conditional probability is not informative by itself (male-dominated sports)
  - $\Pr[A \mid B] \neq \Pr[B \mid A]$ (distracted driving, gateway drugs)
  - Failing to condition on important information (OJ Simpson)
  - Generalizing from a biased sample or failing to realize you have conditioned (hot guys are jerks, selection bias)

Unit 3: Exploratory Data Analysis#

The entire dataset is TMI. How can we extract useful information?

Flight data

Topics Covered:

Data visualization
- Common graphic representations: pie chart, bar chart, time series, histogram, scatterplot
- Best practices
- Misleading and uninformative charts
Summaries of center: what is the one number that best summarizes the data?
- mean, median
Variability: how similar are the different datapoints in the dataset?
- Would you rather be given $\$150$ or flip a coin for $\$300$?
- Variance and standard deviation
- Quantiles and gaps between quantiles
Correlation and correlation coefficient
The usefulness of a summary statistic depends on the data!
- outliers
- skew
- multi-modal data

Unit 4: Correlation and Experiments#

How to analyze data for trends, and how to design experiments.

Effect of sample size on estimates
- Sample vs. population
- Sample size matters for estimation!
  - The standard deviation of the sample mean is $\frac{\sigma}{\sqrt{n}}$
  - To get $10$ times more accurate, you need $100$ times more samples.
Normal Approximation for the sample mean
- Confidence intervals
- 68-95-99 rule
Selection bias dramatically affects estimates!
- Gettysburg address experiment
- Common sources of selection bias
Hypothesis testing: is my observation a real trend, or just noise?
- Null hypothesis: a probability model for “just noise”
- $p$-value: the chance we observed our outcome, or something more extreme, under the null hypothesis
- Significance level, false positive rate, and multiple testing
- Using simulation to compute $p$-values:
  - Testing for correlation
  - Potential outcomes model
Randomized controlled experiments
- Drawbacks of observational studies
- Correlation vs. causation and confounding/hidden variables
- Effect of selection bias
- Potential outcomes model

Unit 5: Machine Learning and Regression#

Statistics is often concerned with making predictions.

On observation $x$, predict outcome $y$.

$x$ is symptoms/test results, $y$ is diagnosis
$x$ is SAT score, $y$ is first-year GPA
$x$ is weather now, $y$ is weather later

We construct a simple model $f$ so that $f(x) = \hat{y}$, with the goal that $\hat{y}$ is as close to $y$ as possible.

Building models
- It is easier to learn from examples than build a model by hand
- “training” data
- How to evaluate models: “training” vs. “testing”
Types of prediction problems:
- regression
- classification
- text generation (next word prediction)
Examples of models
- Linear and quadratic regression
- $k$-nearest-neighbors
- Markov text generators
Training data is everything!
- Selection bias in training data leads to biased models
- If $x$ is far from all training examples, $f(x)$ is probably not that accurate
- More (good) data and better coverage improves performance

tl;dr: three themes#

The three major ideas that I want you to take away from this class.

Theme 1: Insight from simple models#

The world is complicated.

Answering a question exactly is overwelming and often impossible.

Strategy: construct a simple model of the situation.

At least within the simple model, we have the power to answer questions precisely and often quantitatively.

Ballpark estimates and cost-benefit analysis.
Hypothesis testing.
Machine learning and prediction.
Decision making in sports

With great power comes great responsibility.

Know the strengths and limitations of your model.

Theme 2: Conditioning matters#

We might understand an uncertain situation well, but everything can change if we condition!

Common mistakes in conditional probability
- False positives for medical tests, distracted driving, OJ Simpson, male-dominated sports
Selection bias
- Hot guys are jerks
- Biased estimates
- Biased ML predictions from biased training data
Multimodal data affects interpretation of summary statistics
- Male vs. female penguin body mass
- Does generic medical advice apply to you?
  - Salt, hypertension, men vs. women

Theme 3: Critical thinking is essential#

Once you specify the model, statistics can give precise answers.

Is our model good? Does it fit the situation?

Think critically! Don’t calculate blindly.

“When means mislead”
- Usefulness of fundamental summary statistics (mean, median, standard deviation) depends on data (outliers, skew)
Correlation vs. causation
- Confounding variables
- Experimental design
Misleading graphs and figures
Multiple testing and $p$-hacking

We did a lot of “what does this mean in plain English?” exercises.

Thinking like this is important—I do this in my research and in my daily life constantly.

Even though a concept is formal and/or technical, we can and should try to really understand.

Feedback#

I want to make STATS60 great!

Please take a couple of minutes to give some feedback on the course this quarter.

I’m stats-curious. What’s next?#

If you like exploratory data analysis#

STATS 32: Introduction to R
- Learn the basics of programming in R
- Example/application-focused class
DATASCI 112: Principles of Data Science
- Deeper dive into data visualization and data analysis
- More machine learning: how to train and evaluate ML models
- Practice programming basics

If you like experiments and hypothesis testing#

STATS 191: Introduction to Applied Statistics
- Deeper dive into methods for data analysis and prediction
- Applications to biology and social sciences

After taking probability theory,

STATS 200: Introduction to Theoretical Statistics
- Hypothesis testing
- Estimation and confidence intervals
- Bayesian methods
- Some theory of machine learning

If you like probability#

STATS 117: Introduction to probability theory
- Dive into probability theory
- Simple discrete models (coinflips, bags of marbles)
- Continuous models (Normal)
STATS 118: Probablity theory for statistical inference
- Deeper dive into probability theory
- Theory behind the Normal approximation
- Math behind some popular hypothesis tests

If you like machine learning#

CS 106EA: Exploring artifical intelligence
- Training and evaluating ML models
- How do neural networks work?
- Challenges in ML: overfitting, bias, distribution shift

After taking MATH 51 and CS 106:

CS 129: Applied Machine Learning
- More ML models:
  - logistic regression
  - support vector machines
  - deep learning
- “Unsupervised learning”: clustering and feature discovery

Thanks for a great quarter!#