## Lecture 20: Multiple Testing

STATS 60 / STATS 160 / PSYCH 10


**Concepts and Learning Goals:**

- False positive \& False negative rates
    - Significance *level* of a test
- Multiple testing: testing multiple hypotheses at once
    - $p$-hacking
    - Family-wise error rate
    - Bonferroni correction

<div style="display: flex; justify-content: "right"; flex-direction: column; align-items: "right";">
  <div>
    <p style="font-size: smaller; text-align: "right"; margin-top: 4px;"></p>
  </div>
</div>

## Hypothesis testing recap

We gather data and seem to observe a trend.
How likely is it that this trend is *real*? Could it just be the result of *random noise*? 


Hypothesis testing paradigm:

1. State a **null hypothesis**:
    - A model for how the data could have been generated if it was just random noise

2. Calculate a $p$-value: 
    - The probability of seeing an outcome/trend at least as extreme as ours, if the data was just random noise (if the null hypothesis were true).

3. Decide if you should reject the null hypothesis.
    - Choose a *level* $\alpha$ at which to reject the null hypothesis, say $\alpha = 0.05$.
    - If the $p$-value is smaller than $\alpha$, we "reject the null hypothesis."
    - If the data really was random noise, we'd expect to see something like our outcome $\le \alpha$ fraction of the time.

## False positives

Suppose I classify Monday's $8$ images as AI or not by randomly guessing.


If I get lucky and guess $7$ or more correct, then my $p$-value will be $< 5\%$.

In [None]:
import math
for t in range(9):
    print("p-value for >=",t," correct guesses is ", sum([math.comb(8,k) * (1/2)**(8) for k in range(t,9)]))

What if I make a lot of repeat attempts?


In 1000 attempts, I expect to get 7 or more correct about 30 times!

In [None]:
from scipy.stats import binom
num_lucky = sum([1 for _ in range(1000) if binom.rvs(n=8, p=0.5, size=1) >= 7])
print("The number of lucky attempts was ",num_lucky," out of 1000")

In my hypothesis test, this would cause a **false positive:** we falsely conclude that I am probably good at deciding if images are AI/not.

## Questions

1. Suppose our data really is just random noise (the null hypothesis is true), but we perform the same experiment $100$ times.

    If we set our level for rejecting the null hypothesis at $\alpha = 0.05 = \frac{1}{20}$, how many times would we expect to end up rejecting the null hypothesis?

    In other words, out of 100 trials, how many *false positives* do we expect to see?

    - Answer: we expect $100 \cdot \alpha = \frac{100}{20} = 5$ false positives.

2. Is the $p$-value a random quantity?

    - Remember that the $p$-value is the chance we saw *our data* or something more extreme, under the null hypothesis.

    - Answer: yes, it is a random quantity.

## False positives

We observed a seeming trend, and performed a hypothesis test.


A **false positive** occurs when the null hypothesis was secretly true, but based on the outcome of our hypothesis test, we decide that the trend is likely real. 


This happens because:

1. There is variability in our data (it comes from a random sample)

2. Hence, there is variability in our $p$-value (the $p$-value is a function of the data)

3. Even if the data is generated according to the null hypothesis, we'll see data that causes a very small $p$-value some of the time.


## Significance level

The *level* or *significance level* $\alpha$ at which we reject the null hypothesis controls the fraction of false positives.

The *level* is the threshold $\alpha$ that we set for the $p$-value:

- If the $p$-value is less than $\alpha$, we reject the null hypothesis.

- When we design our test, we *choose* the level. A common choice of level is $\alpha = 0.05$.


**Question:** If you increase the level, will the fraction of false positives increase or decrease?

- If the level is $\alpha$, we expect to get an $\alpha$-fraction of false positives.


**Question:** In scientific publication, it is standard to require the level $\alpha = 0.05$. What percent of published statistically significant trends do you expect are actually random noise?

- It could be as high as $5\% = 1/20$! Because $5\%$ is the level required for publication.


**Question:** Why shouldn't I just set my level $\alpha = 1/100000$ or even $\alpha = 0$? Then I will never get a false positive.

- A small level $\alpha$ means we extremely skeptical that any trend we observe is statistically significant. 
- If we setting the level too small we'll get a lot of false negatives.

- If $\alpha = 0$, we will literally never reject the null hypothesis.

## Multiple testing

Can you think of any situations where you would naturally want to do a lot of different experiments in parallel?

- A biology/physics/chemistry lab has multiple experiments going at once

- A pharmaceutical company might be developing multiple drugs at once

- A company might be testing out multiple versions of the same product

- Many different contestants trying to guess something (like a lottery number)


If we do a lot of experiments in parallel, the chances of getting at least one false positive increase.

## Are you psychic?



<div style="display: flex; justify-content: "left"; flex-direction: column; align-items: "left";">
  <div>
    <img src="https://psychicscience.org/esp3)](https://tselilschramm.org/introstats/figures/psychic-qr.png" style="width:"200";"/>
    <p style="font-size: smaller; text-align: "left"; margin-top: 4px;">[Psychic Science</p>
  </div>
</div>



Choose the following procedure:

- 25 cards

- Clairvoyance

- Open deck

- Cards seen





How many did you get right?

<div style="display: flex; justify-content: center; flex-direction: column; align-items: center;">
  <div>
    <img src="https://tselilschramm.org/introstats/figures/ESP-poll.png" style="width:"200";"/>
    <p style="font-size: smaller; text-align: center; margin-top: 4px;"></p>
  </div>
</div>

## Hypothesis testing 

Do you have ESP? Formulate a hypothesis test.

1. What is your null hypothesis?

2. What is the $p$-value (in plain English)? How would you compute it?

3. The ESP applet gave you a $p$-value. Do you reject the null hypothesis? 


## Class data



## Psychics in STATS 60? {.smaller}

Do you believe that there are psychics among us?


In a class of 50 with no psychics, we would expect $.05 \cdot 50 = 2.5$ False Positives.


In fact, we can model the number of False positives using coinflips!


The number of False Positives is like the number of heads we get if we flip $n = 50$ coins, each with heads probability $0.05$.


**Question:** what is the chance that, as a class of $n = 50$, we get at least one false positive?

- Using the probability for coinflips, it should be $$1 - \Pr[ 0 \text{ heads}] = 1 - \binom{50}{0} \left(0.95\right)^{50} \approx 0.92$$


The "overall" False positive rate is over 90%!

# Comic Relief


##

This comic comes from [xkcd](https://xkcd.com/882/).

![](https://tselilschramm.org/introstats/figures/xkcd1.png)

##

<div style="display: flex; justify-content: center; flex-direction: column; align-items: center;">
  <div>
    <img src="https://tselilschramm.org/introstats/figures/xkcd2.png" style="width:70%;"/>
    <p style="font-size: smaller; text-align: center; margin-top: 4px;"></p>
  </div>
</div>


<div style="display: flex; justify-content: center; flex-direction: column; align-items: center;">
  <div>
    <img src="https://tselilschramm.org/introstats/figures/xkcd3.png" style="width:70%;"/>
    <p style="font-size: smaller; text-align: center; margin-top: 4px;"></p>
  </div>
</div>


##

<div style="display: flex; justify-content: center; flex-direction: column; align-items: center;">
  <div>
    <img src="https://tselilschramm.org/introstats/figures/xkcd4.png" style="width:70%;"/>
    <p style="font-size: smaller; text-align: center; margin-top: 4px;"></p>
  </div>
</div>


<div style="display: flex; justify-content: center; flex-direction: column; align-items: center;">
  <div>
    <img src="https://tselilschramm.org/introstats/figures/xkcd5.png" style="width:70%;"/>
    <p style="font-size: smaller; text-align: center; margin-top: 4px;"></p>
  </div>
</div>

##

![](https://tselilschramm.org/introstats/figures/xkcd6.png)


# Multiple Testing

What to do when you want to test multiple hypotheses


## Multiple Testing

In both examples today, we were testing _multiple hypotheses_:

- whether each person in the room has ESP (about 30 tests)

- whether each color of jellybean causes acne (20 tests)


If we do $m > 1$ hypothesis tests, each at level-$\alpha$, then the 
"overall" probability of having a false positive is larger than $\alpha$.

- This can lead to (accidental) "$p$-hacking," a phenomenon wherein the $p$-value of a scientific experiment inaccurately describes its true null probability.

- Usually caused when scientists try to test multiple hypotheses at once, and fail to account for it in their statistical analysis.


This "overall" probability is called the **family-wise error rate**.

$$\text{FWER} = P(\text{at least one false positive}). $$


What if we wanted to guarantee that $\text{FWER} \leq \alpha$?


## Bonferroni Correction

![Carlo Bonferroni (1892-1960)](https://tselilschramm.org/introstats/figures/bonferroni.jpeg)

Instead of testing each of the $m$ hypotheses at level $\alpha$, test 
each hypothesis at level $\frac{\alpha}{m}$.

#### Why Does This Work?

Suppose the null hypothesis is true in 
$m_0$ of the $m$ hypotheses.

$$
\begin{align*}
\text{FWER} &= P(\text{at least one false positive}) \\
&\leq m_0 \cdot P(\text{false positive}) \\
&= m_0 \cdot \frac{\alpha}{m} \\
&\leq \alpha
\end{align*}
$$



## Applying the Bonferroni Correction

If we do 64 tests for ESP, then the Bonferroni correction says: 

- If you want to guarantee that the FWER is less than $.05$,
- test each hypothesis at level $\frac{.05}{64} \approx .0008$.


Is anyone still psychic now?


## The Dark Side of Bonferroni

![](https://tselilschramm.org/introstats/figures/bonferroni_dark.jpeg)

- The Bonferroni correction makes it much harder to 
reject the null hypothesis.
- This keeps the false positive rate under control.
- But if we don't reject the null hypothesis, we risk having a lot of false negatives.


In general, the Bonferroni correction increases the false negative rate.

- So if there is a psychic among us, we are less likely to discover them.

## Recap

- The significance level of a test controls the false positive rate
    - If we perform a test of level $\alpha$, we expect false positives an $\alpha$-fraction of the time
    - If we make the level super small, we'll get a lot of false negatives.

- Multiple testing: when we want to test many hypotheses at once
    - The *family-wise error rate* is the chance of at least one false positive
    - $p$-hacking
    - Bonferroni correction for multiple testing