The Central Limit Theorem (CLT): Definition, Experiments & Uses

2024.11.03 ·#statistics #central-limit-theorem #CLT #normal-distribution #t-distribution

What is the central limit theorem?
An experiment on the central limit theorem (run in Python)
The meaning of the central limit theorem (its uses)

What is the central limit theorem?

The central limit theorem is the theory that, when the sample size is sufficiently large, no matter what the population's distribution is, the distribution of the sample mean approaches a normal distribution.

Figure 1. sample mean distribution (as n grows)

Figure 1 shows that no matter which (arbitrary) distribution you draw the X data from, as the sample size n grows, the distribution of the sample mean approaches a normal distribution. This sample-mean distribution is expressed as follows.

$$ \bar{X} \sim N\!\left(\mu,\; \frac{\sigma^2}{n}\right) $$ Formula 1. The distribution of the sample mean

Let's look at the sampling process in detail.

Figure 2. Sampling from the population

Let the number of times we sample from the population be M, and the size sampled at once (the number of samples) be N.

Fixing the number of samplings at 500, and increasing the sampling size from N=1 to N=30, 300, and 3000, I sampled from the following five distributions.

The five distributions experimented on

Uniform distribution
Gaussian distribution
F distribution
Chi-squared distribution
Bernoulli distribution

Central Limit Theorem experiment

Figure 3. M=500, N=1

In Figure 3, at N=1 you can confirm that the uniform, F, and Bernoulli distributions differ greatly from the shape of the Gaussian distribution. On the other hand, the chi-squared distribution turned out to be quite similar to the Gaussian distribution. (Naturally, the Gaussian distribution must follow a normal distribution.)

Figure 4. N=30 (M=500)

In Figure 4, at N=30, you can confirm that the uniform, F, and Bernoulli distributions are gradually changing into a shape progressively similar to the Gaussian distribution.

📌 The meaning of sampling size 30 in the central limit theorem
When the sample size is 30 or more, no matter what shape of distribution you sample from, its sample mean shows a tendency to converge to a normal distribution.

Figure 5. N=300 (M=500)

Figure 6. N=3000 (M=500)

Looking at Figures 5 and 6 above, you can confirm that as the sampling size grows larger, it fits the normal distribution ever more precisely.

The Python file for the experiment is uploaded on GitHub. (If you need it, just run it.)

So, where and how can the central limit theorem be used?

(This is probably one of the most curious yet important parts.)

Figure 7. The uses of the central limit theorem

One of the important goals in statistics is to estimate the mean, the representative value of a particular group. Whether point estimation or interval estimation, understanding and exploring the mean is the core. In this regard, the central limit theorem is very important. In reality, the distribution shape of the target of interest we investigate is unknown in most cases. However, the central limit theorem tells us that as the sample size N grows, the distribution of the sample mean approaches a normal distribution, and this gives us great hope in the analysis process. If we make N large enough, the sample mean follows a normal distribution, and through this, parameter estimation or confidence-interval setting becomes possible. However, if we don't know the population standard deviation σ, we use the t-distribution instead of the normal distribution to perform such estimation.

Figure 8. The relationship between the t-distribution and the z (normal) distribution

The t-distribution expresses the sample error based on the sample standard deviation, and here the graph can be drawn with the sample information alone. As you can see in Figure 8 above, as the t-distribution's degrees of freedom (df) increase, the distribution gradually approaches the Z-distribution (normal distribution). For example, you can confirm that at df=30 (about N=31) it becomes almost similar to the Z-distribution. For this reason, in a mean test, when n>30, it's fine to use either the t-distribution or the normal distribution. On the other hand, when the sample size is under 30, it's better to use a nonparametric statistical test.

To summarize: no matter which distribution you sample from, if the sample size is 30 or more and a certain number of samplings is guaranteed, the distribution of the sample mean approaches a normal distribution. Since we can't know the population standard deviation at this point, we can do interval estimation through confidence intervals using the t-distribution.

Conclusion

When N is 30 or more, the distribution of the sample mean approaches a normal distribution by the central limit theorem, but since we can't know the actual Z-distribution, we end up using the t-distribution. When N is 30 or more, the t-distribution also approaches the normal distribution, so using the t-distribution causes no big problem. That is, when the sample size is small the normality assumption is important, but as the sample size grows, by the central limit theorem the distribution of the sample mean approaches a normal distribution, so using the t-distribution makes little difference.

References

Basic (introductory) statistics course — Jayu Academy (textbook)

📦 Migrated from the Tistory blog I used to run. Original: taehyuklee.tistory.com/25

The Central Limit Theorem (CLT): Definition, Experiments & Uses

Table of contents

What is the central limit theorem?

The five distributions experimented on

Central Limit Theorem experiment

So, where and how can the central limit theorem be used?

Conclusion

References

Comments

Table of contents

What is the central limit theorem?

The five distributions experimented on

Central Limit Theorem experiment

So, where and how can the central limit theorem be used?

Conclusion

References

Related posts

Comments