Lecture 16: Sample size matters#
STATS 60 / STATS 160 / PSYCH 10
Concepts and Learning Goals:
Estimating an unknown quantity by sampling
Larger sample improves accuracy
Population vs. sample distribution
Quantifying how accuracy improves with sample size
standard deviation of the sample mean
Welcome to Unit 4: Correlation and Experiments!#
A burning question#
You want to know what fraction of people on campus think that a hot dog is a type of sandwich; call this unknown fraction
The population is large enough that you cannot afford to ask each person—it’s too time consuming, and too hard to ensure you really asked everyone.
What do you do?
.](https://tselilschramm.org/introstats/figures/cuberule.png)
The cube rule of food. Image credit: [wikihow
Estimating from a sample#
You estimate the fraction
Sample a uniformly random subset of
people (with vs. without replacement doesn’t really matter if the population is large enough)Suppose
out of sampled think a hot dog is a sandwich.
You take
as an estimate for .
Question: Can you model this probabilistically with either a bag of marbles or with coinflips?
Question: Are we guaranteed that
How good is our estimate?#
Our poll is basically an experiment whose goal is to measure
Our estimate
is a random, noisy measurement of .The value of
depends on the sample. If we repeat the experiment, we could get a different value of .It’s important for us to know if we can trust our estimate.
We can use probability (modeling with coinflips) to calculate, directly, the probability that
is far from .We can also approximate this probability in a smarter way, using a Normal approximation. More on Wednesday.
Another scenario#
We want to determine the concentration of microplastics in Palo Alto tap water.
The concentration of microplastics is a fixed quantity,
How can we estimate
We take
independent water samples, measure the concentration of microplastics in each, and produce a set of measurements .We estimate
using the sample mean, .
How accurate is our estimate
The estimate is random. If we repeat our experiment, we could get a different value of
.Can we compute the probability that
is big?In this case, it is not clear how to model this scenario with coins or marbles; we don’t know how the error of our measurements behaves.
We can still approximate this probability using a Normal approximation! More on Wednesday.
Size matters#
Before we get into understanding the probability that our estimate is accurate, let’s make one thing clear: Sample size matters.
Question: What do you expect to happen to our estimate
Since
DIY poll#
From the class survey, we know
What if we try to estimate
For each
We’ll conduct
polls, each of students from our class.a. Each of the
polls produces responses.b. We’ll compute a separate estimate
based on each pollc. The
’s form a dataset with samples.d. We’ll plot the histogram of the dataset.
But we’ll simulate it with some code, to save time :)
import matplotlib.pyplot as plt
import random
T = 50000
mu = 43/107
Class = [1] * 43 + [0] * (107 - 43) # Make a list of "students" in the class; 1's are hot dog = sandwich people
Estimates = [0]*T
def trial(n):
for t in range(T): # Run T independent polls
X = random.sample(Class,n) # Survey n students each time
Estimates[t] = sum(X)/n # Record Q_n
if n > 15:
numbins = 12
else:
numbins = 15
plt.hist(Estimates, bins=10) # Plot a histogram of the data
plt.xlabel('Estimate of $\mu$')
plt.title('Variability in estimate of $\mu$, n ='+str(n))
plt.axvline(x=mu, color='red', linestyle='--', linewidth=1)
plt.xlim(0,1)
plt.show()
<>:20: SyntaxWarning: invalid escape sequence '\m'
<>:21: SyntaxWarning: invalid escape sequence '\m'
<>:20: SyntaxWarning: invalid escape sequence '\m'
<>:21: SyntaxWarning: invalid escape sequence '\m'
/tmp/ipykernel_182615/870793613.py:20: SyntaxWarning: invalid escape sequence '\m'
plt.xlabel('Estimate of $\mu$')
/tmp/ipykernel_182615/870793613.py:21: SyntaxWarning: invalid escape sequence '\m'
plt.title('Variability in estimate of $\mu$, n ='+str(n))
Smallest sample, .#
trial(2)

Small sample, .#
trial(4)

Medium-small sample, .#
trial(8)

Medium sample, .#
trial(16)

Largeish sample, .#
trial(32)

Large sample, #
trial(64)

What do you notice?#
Question: As we increase
The variability of the dataset of our estimates
?The shape of the histogram?
Larger sample size leads to decreased variability and increased accuracy of
—it is more likely to be close to .The shape of the histogram looks more and more like an upside down bell.
We’ll focus on 1 today, return to 2 on Wednesday.
Bigger samples are better#
The tl;dr of this lecture is that sample size matters, and bigger is better.
A larger sample size leads to a more accurate estimate.
But now, we want to quantify how much more accurate the estimate becomes as the sample size increases.
For that, we formalize the concept of the population vs. the sample.
Population vs. sample#
Our experiments are all versions of the following meta:
There is variable
which describes members of a “population”. Our goal is to estimate the population mean, .We take
independent samples from the population, forming a sample dataset $ $We form an estimate of
using the sample mean $ $
The sample mean is a random variable; its value depends on the randomness of our sample.
Our hope is that it is usually close to
Hotdog poll as “population vs. sample”#
We conduct a poll to figure out what fraction of Stanford students think that a hot dog is a sandwich.
The “population” is the
students on campus.There is a variable
which describes members of a “population”:For each student, the variable
takes value if the student thinks a hot dog is a sandwich, and otherwise.The population mean,
, is exactly the fraction of students who think a hot dog is a sandwich:
We take
independent samples from the population, forming a dataset $ $Each student’s yes/no answer can be recorded as
if the student said yes, if they said no.
We estimate
using the sample mean .
Microplastics as “population vs. sample”#
We take
The “population” is all possible water samples.
There is a variable
which describes members of a “population”:In this experiment, the variable
is the concentration of microplastics in particular water sample.The population mean,
, is the average concentration of microplastics in a water sample.
We take
independent samples from the population, forming a dataset $ $Each water samples gives us a measurement
, the concentration of microplastics in that particular water sample.
We estimate
using the sample mean .
Now you try#
A medical researcher has come up with a new drug.
The drug has a side effect: a headache that lasts anywhere from
The researcher recruits a group of
What is the population?
What is the variable
that describes members of the “population”? What is the population mean ?What are the samples?
Statistics of the sample vs. the population#
Suppose the population mean for the variable
The sample mean
Different samples could lead to different values of
.
Assuming our
The expectation of the sample mean is equal to the population mean, $
$The standard deviation of the sample mean is $
$That’s a factor
smaller than the standard deviation of !
Quantifying “bigger is better”: As
Larger samples give more accurate estimates#
A large sample size reduces the standard deviation of the sample mean.
This is just a feature of the distribution of
(As we discussed in lectures 14&15, this is guaranteed by something called Chebyshev’s inequality).
So to get
We’ll get more precise on Wednesday.
Recap#
We introduced the strategy of estimating an unknown feature of a population by random sampling.
We introduced the paradigm of the population vs. sample mean.
Sample size matters! Bigger is better.
The sample standard deviation is a
factor smaller than the population standard deviation.