Lecture 28: Outtro#

STATS 60 / STATS 160 / PSYCH 10

Concepts and Learning Goals:

  • Overview of what you learned this quarter!

  • tl;dr: three themes

  • What’s next if you’re stats-curious?

Remembering our journey#

Unit 1: Thinking about scale#

In statistics and data science, we are trying to use numbers to

  1. Describe our observations

  2. Quantify how confident we are

Numbers are only meaningful in context.

  • Is $ 10 billion a lot of money?

    Forbes' real-time billionaires' list

    GDP heat map, in billions

Unit 1 was about building tools for contextualizing numbers and thinking critically about scale:

  • Three questions for contextualizing numbers:

    1. What type of number is this?

    2. What can I compare this number to? Is it large or small compared to other similar values?

    3. What would I have expected this number to be?

  • Ballpark estimates for estimating a number

    1. Set up a simple model to compute the quantity approximately by break up the estimate into small parts

      • How many visitors go on guided tours at Stanford per year?

        (# visitors / year) = (# days/ year) x (# tours/ day) x (# visitors / tour)

    2. Approximate parts up to a factor of 10

  • Cost-benefit analysis: a simple model helps us make difficult decisions

Unit 2: Probability#

A mathematical framework for modeling uncertain scenarios.

Is an observed pattern meaningful, or just random noise?

The probability we learned was essential for:

  • Hypothesis testing

  • Confidence intervals for estimation

  • Understanding selection bias

Topics covered:

  1. Probabilistic experiments

    • Formalism: sample spaces, outcomes, events, probabilities

    • Modeling almost everything with coinflips, dice, and bags of marbles

      Drawing a marble from the bag.

  2. Coincidences?

    • Even if an event is rare, it is likely to happen when you repeat an experiment many times

    • Example: the birthday paradox

    • Example: winning streak

    • Example from unit 4: multiple testing!

  3. Computing probabilities

    • Coinflips and binomial coefficients

      • Later used in computing \(p\)-values

    • Law of the complement

  4. Conditional probability

    • “Zooming in”

      Zooming in on $B$. Image credit to Blitzstein and Hwang, chapter 2.

    • Bayes’ rule

    • Common mistakes in conditional probability:

      • Base rate fallacy: the conditional probability is not informative by itself (male-dominated sports)

      • \(\Pr[A \mid B] \neq \Pr[B \mid A]\) (distracted driving, gateway drugs)

      • Failing to condition on important information (OJ Simpson)

      • Generalizing from a biased sample or failing to realize you have conditioned (hot guys are jerks, selection bias)

Unit 3: Exploratory Data Analysis#

The entire dataset is TMI. How can we extract useful information?

Flight data

Topics Covered:

  1. Data visualization

    • Common graphic representations: pie chart, bar chart, time series, histogram, scatterplot

    • Best practices

    • Misleading and uninformative charts

  2. Summaries of center: what is the one number that best summarizes the data?

    • mean, median

  3. Variability: how similar are the different datapoints in the dataset?

    • Would you rather be given \(\$150\) or flip a coin for \(\$300\)?

    • Variance and standard deviation

    • Quantiles and gaps between quantiles

  4. Correlation and correlation coefficient

  5. The usefulness of a summary statistic depends on the data!

    • outliers

    • skew

    • multi-modal data

Unit 4: Correlation and Experiments#

How to analyze data for trends, and how to design experiments.

  1. Effect of sample size on estimates

    • Sample vs. population

    • Sample size matters for estimation!

      • The standard deviation of the sample mean is \(\frac{\sigma}{\sqrt{n}}\)

      • To get \(10\) times more accurate, you need \(100\) times more samples.

  2. Normal Approximation for the sample mean

    • Confidence intervals

    • 68-95-99 rule

    • Image from Wikipedia.

  3. Selection bias dramatically affects estimates!

    • Gettysburg address experiment

    • Common sources of selection bias

  4. Hypothesis testing: is my observation a real trend, or just noise?

    • Null hypothesis: a probability model for “just noise”

    • \(p\)-value: the chance we observed our outcome, or something more extreme, under the null hypothesis

    • Significance level, false positive rate, and multiple testing

    • Using simulation to compute \(p\)-values:

      • Testing for correlation

      • Potential outcomes model

  5. Randomized controlled experiments

    • Drawbacks of observational studies

    • Correlation vs. causation and confounding/hidden variables

    • Effect of selection bias

    • Potential outcomes model

Unit 5: Machine Learning and Regression#

Statistics is often concerned with making predictions.

On observation \(x\), predict outcome \(y\).

  • \(x\) is symptoms/test results, \(y\) is diagnosis

  • \(x\) is SAT score, \(y\) is first-year GPA

  • \(x\) is weather now, \(y\) is weather later

We construct a simple model \(f\) so that \(f(x) = \hat{y}\), with the goal that \(\hat{y}\) is as close to \(y\) as possible.

  1. Building models

    • It is easier to learn from examples than build a model by hand

    • “training” data

    • How to evaluate models: “training” vs. “testing”

  2. Types of prediction problems:

    • regression

    • classification

    • text generation (next word prediction)

  3. Examples of models

    • Linear and quadratic regression

    • \(k\)-nearest-neighbors

    • Markov text generators

  4. Training data is everything!

    • Selection bias in training data leads to biased models

    • If \(x\) is far from all training examples, \(f(x)\) is probably not that accurate

    • More (good) data and better coverage improves performance

tl;dr: three themes#

The three major ideas that I want you to take away from this class.

Theme 1: Insight from simple models#

The world is complicated.

Answering a question exactly is overwelming and often impossible.

Strategy: construct a simple model of the situation.

At least within the simple model, we have the power to answer questions precisely and often quantitatively.

  1. Ballpark estimates and cost-benefit analysis.

  2. Hypothesis testing.

  3. Machine learning and prediction.

  4. Decision making in sports

With great power comes great responsibility.

Know the strengths and limitations of your model.

Theme 2: Conditioning matters#

We might understand an uncertain situation well, but everything can change if we condition!

  1. Common mistakes in conditional probability

    • False positives for medical tests, distracted driving, OJ Simpson, male-dominated sports

  2. Selection bias

    • Hot guys are jerks

    • Biased estimates

    • Biased ML predictions from biased training data

  3. Multimodal data affects interpretation of summary statistics

    • Male vs. female penguin body mass

    • Does generic medical advice apply to you?

      • Salt, hypertension, men vs. women

Theme 3: Critical thinking is essential#

Once you specify the model, statistics can give precise answers.

Is our model good? Does it fit the situation?

Think critically! Don’t calculate blindly.

  1. “When means mislead”

    • Usefulness of fundamental summary statistics (mean, median, standard deviation) depends on data (outliers, skew)

  2. Correlation vs. causation

    • Confounding variables

    • Experimental design

  3. Misleading graphs and figures

  4. Multiple testing and \(p\)-hacking

We did a lot of “what does this mean in plain English?” exercises.

Thinking like this is important—I do this in my research and in my daily life constantly.

Even though a concept is formal and/or technical, we can and should try to really understand.

Feedback#

I want to make STATS60 great!

Please take a couple of minutes to give some feedback on the course this quarter.

I’m stats-curious. What’s next?#

If you like exploratory data analysis#

  • STATS 32: Introduction to R

    • Learn the basics of programming in R

    • Example/application-focused class

  • DATASCI 112: Principles of Data Science

    • Deeper dive into data visualization and data analysis

    • More machine learning: how to train and evaluate ML models

    • Practice programming basics

If you like experiments and hypothesis testing#

  • STATS 191: Introduction to Applied Statistics

    • Deeper dive into methods for data analysis and prediction

    • Applications to biology and social sciences

After taking probability theory,

  • STATS 200: Introduction to Theoretical Statistics

    • Hypothesis testing

    • Estimation and confidence intervals

    • Bayesian methods

    • Some theory of machine learning

If you like probability#

  • STATS 117: Introduction to probability theory

    • Dive into probability theory

    • Simple discrete models (coinflips, bags of marbles)

    • Continuous models (Normal)

  • STATS 118: Probablity theory for statistical inference

    • Deeper dive into probability theory

    • Theory behind the Normal approximation

    • Math behind some popular hypothesis tests

If you like machine learning#

  • CS 106EA: Exploring artifical intelligence

    • Training and evaluating ML models

    • How do neural networks work?

    • Challenges in ML: overfitting, bias, distribution shift

After taking MATH 51 and CS 106:

  • CS 129: Applied Machine Learning

    • More ML models:

      • logistic regression

      • support vector machines

      • deep learning

    • “Unsupervised learning”: clustering and feature discovery

Thanks for a great quarter!#