The Central Limit Theorem — Definition, Experiments, and Uses
What is the Central Limit Theorem?
It's the theorem that if the sample size is large enough, the distribution of the sample mean approaches a normal distribution regardless of the population's distribution.
Figure 1. Sample mean distribution (as n grows)
Even drawing X from an arbitrary distribution, as the sample size $n$ grows the distribution of the sample mean approaches normal. This distribution of the sample mean is written as:
$$\bar{X} \sim N\!\left(\mu,\ \dfrac{\sigma^2}{n}\right)$$ Eq. 1. Distribution of the sample mean
Figure 2. Sampling from the population
Let the number of samplings be $M$ and the size drawn at once (number of samples) be $N$. Fixing $M=500$ and increasing $N=1, 30, 300, 3000$, I sampled from five distributions.
The 5 distributions tested
- Uniform
- Gaussian
- F distribution
- Chi-squared
- Bernoulli
CLT Experiments
Figure 3. M=500, N=1
At $N=1$, the uniform, F, and Bernoulli look very different from Gaussian. (Chi-squared is relatively similar, and Gaussian is naturally normal.)
Figure 4. N=30 (M=500)
At $N=30$, the uniform, F, and Bernoulli also gradually shift toward a Gaussian shape.
The meaning of sample size 30 — with a sample size of 30 or more, sampling from any shape of distribution tends to make the sample mean converge to a normal distribution.
Figure 5. N=300 (M=500)
Figure 6. N=3000 (M=500)
As $N$ grows further, it fits the normal distribution ever more precisely.
So Where Is the CLT Used?
Figure 7. Uses of the Central Limit Theorem
One important goal of statistics is to estimate the mean, the representative value of a group. In reality, the distribution shape of the subject of interest is usually unknown, but the CLT tells us the sample mean approaches a normal distribution as the sample size $N$ grows, which is a big hope for analysis. Grow $N$ enough and the sample mean follows a normal distribution, enabling parameter estimation and confidence intervals. However, if the population standard deviation $\sigma$ is unknown, we use the t-distribution instead of the normal.
Figure 8. The relationship between the t-distribution and the z (normal) distribution
The t-distribution is based on the sample standard deviation, so it can be drawn from sample information alone. The larger the degrees of freedom (df), the closer it gets to the Z distribution, becoming nearly identical at df=30 (about $N=31$). So for a mean test, if $n>30$ you may use either the t- or normal distribution, and below 30 a nonparametric test is recommended.
Conclusion
If $N$ is 30 or more, the distribution of the sample mean approaches normal by the CLT, but since we can't know the actual Z distribution, we use the t-distribution. When $N \geq 30$, the t-distribution is also close to normal, so using the t-distribution is fine. In other words, the normality assumption matters when the sample is small, but as it grows the CLT makes the t-distribution sufficient.
References
- Basic Statistics (Introduction) — Jayu Academy (textbook)
📦 Migrated from my own Korean blog (my own writing). Original: taehyuklee.tistory.com/25
Comments