What Are Degrees of Freedom (DOF) & Population-Variance Estimation and Degrees of Freedom (n-1)

2024.09.30 ·#statistics #degrees-of-freedom #variance #population-variance #estimation

A side note

This is a post where, while recently building a data-analysis platform, I look back on and organize the data analysis I did as an undergraduate intern in 2019–2020 and during grad school. It's also a post organizing the concept of degrees of freedom, which I hadn't understood well before.

Goal of this post

In this post I want to talk about population-variance estimation and degrees of freedom in parametric statistics.

After reading this, the points you should understand can be organized as follows:

The concept of degrees of freedom
Why we divide by n-1 in the population-variance estimator from a sample - Explanation from the underestimation perspective: why we divide by n-1, from the population distribution - Explanation from the degrees-of-freedom perspective (understanding the essential meaning of variance): why the variance estimator from a sample divides by the degrees of freedom, and the essence behind it

Main body

Formula 1 and Formula 2 below represent, respectively, the variance (population-variance estimate) and the population variance.

$$ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(X_i - \bar{X}\right)^2 $$ Formula 1. Sample variance estimate

$$ \sigma^2 = \frac{1}{N}\sum_{i=1}^{N}\left(X_i - \mu\right)^2 $$ Formula 2. Population variance

The concept around variance or standard deviation can be interpreted intuitively as follows.

0) To express the degree of uncertainty — to quantify the fact that, each time you run the same trial, a different result appears due to uncertainty.
1) The average distance each data point lies from the mean — to express uncertainty, it expresses, as a distance, how far the data lie from the mean that represents the group.
2) It can be called the precision or explanatory power regarding the sample mean. (implication)

From here, I'll continue the story by raising the questions I had on my own.

1. Why divide by n-1

Deviation or variance is physically a distance concept, yet the variance estimate of the standard deviation divides not by the number of samples (n) but by the number of samples minus one (n-1). Why on earth do we divide by n-1?

Answer 1) If you simply wanted to express how far the data inside the sample lie from the mean, dividing by n would be right. But that formula is an estimator that tries to estimate the population from the sample. In other words, the sample's own statistic isn't what matters — what matters is how well it estimates the population parameter.

2. What's the problem if we divide by n?

If we divided the variance estimator of the standard deviation by n instead of n-1, what happens?

Answer 2) The answer is simple — biased estimation occurs; precisely, here it underestimates. That is, it estimates smaller than the original population variance.

3. Why does underestimation (biased estimation) happen?

Why does dividing by the sample count as-is cause underestimation (biased estimation)?

Sampling from the population — denser intervals are reflected more Figure 1. Sampling from the population

Answer 3) * This answer includes, to some degree, my own opinion and thinking.

Figure 1 above shows the sample distribution obtained by sampling from the population. Here, the interval marked "Interval 1" in the population would have a high probability of being sampled. In that case, the sampled distribution reflects the denser intervals within the population more. That is, compared to the population's overall distribution, the sampled distribution can be denser. As a result, the distribution of the sample mean will be denser than the population's distribution, and this hints that computing variance by simply dividing by the number of samples leads to underestimation (biased estimation).

Therefore, when estimating the population variance using the sample variance, a correction is needed. For this, instead of simply dividing by the sample size n, we divide by n−1 to correct the variance.

What it means to divide by n-1

In fact, the larger n grows, the closer the sample gets to the population, so there's no need to divide by n−1 — dividing by n (almost N) is enough. As n grows, n−1 approaches n, so the effect of the 1 nearly disappears; but when n is small, the difference of subtracting 1 from n is relatively larger, so its influence is bigger. In other words, n-1 seems appropriate. However, here another question, like #4 below, can arise.

4. Has it been proven mathematically?

I get that underestimation happens so we correct with n-1, but I still don't understand why we have to divide by n-1 specifically. Has this part been proven?

Answer 4) The mathematical proof for this is as follows.

$$ E\!\left[\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2\right] = (n-1)\,\sigma^2 \;\;\Longrightarrow\;\; E\!\left[\frac{1}{n-1}\sum_{i=1}^{n}\left(X_i-\bar{X}\right)^2\right] = \sigma^2 $$ Figure 2. Proving the correction is n-1 — the point where the expectation equals the population variance σ² (an unbiased estimator) is n-1.

It's proven as in the formula above. As pondered in the 3rd question, you can confirm it's not only an intuitive understanding but also mathematically proven.

5. They call n-1 the degrees of freedom — what is the concept of degrees of freedom?

You've probably often seen dividing by n-1 called degrees of freedom. But let's first learn what degrees of freedom is.

📌 What are degrees of freedom in statistics? It's explained as the number of data points that can actually vary independently when we compute some estimate.

Let's think about what the above means.

As in Figure 3 above, suppose we drew 5 samples from the population. And suppose the sample mean is 20.

What number should X1 be drawn as?

It can be drawn randomly from the population distribution. It's not fixed to a particular value and exists independently as a random variable. Put differently, random sampling from the population distribution is possible. (Any number is fine.)

What number should X2 be drawn as?

X2, like X1, is freely sampled according to the population distribution.

Likewise, X3 and X4 are also drawn independently from the population. These data can vary freely.

But what number should X5 be drawn as?

The last data point, X5, is not randomly sampled from the population; it is automatically fixed according to the values of X1 through X4. This is because of the constraint that the sample mean is 20. Therefore the last data point, X5, cannot be determined freely and is automatically fixed by the remaining values.

For example, if X1 = 25, X2 = 10, X3 = 15, X4 = 40 came out, then for the sample mean to be 20, X5 is fixed at 10. Thus the last data can no longer be determined randomly and becomes a fixed value.

Generalization

When you draw n data points, n-1 of them are drawn freely and randomly, but the last, n-th, data point is determined by the constraint of the sample mean — by the remaining n-1 values. Here, the number of data points that can vary independently is n-1, and this is called the degrees of freedom.

(Aside) Degrees of freedom in mechanical engineering?

Degrees of freedom isn't a concept used only in statistics. Generally, when describing motion in kinematics — for example, if something can move along the 3 axes x, y, z — we say the DOF (Degree Of Freedom) is 3. If translation along x is restricted, then it can move only along y and z, so the DOF is 2. Likewise, the essential concept of degrees of freedom means the number of elements that can change.

6. When computing variance, what on earth does degrees of freedom mean? Why do we divide by the degrees of freedom? (essence)

Then, from the degrees-of-freedom perspective, what relationship does it have with the population-variance estimator?

Answer 5) Above, we explained that the justification for dividing by n-1 when estimating from a sample is to correct underestimation (biased estimation). This time, I want to explain variance from the degrees-of-freedom perspective.

The flow of questions (Story)

1. Here we can ask a fundamental question. Why must the sample mean be fixed? As you sample, the sample mean could be 25 rather than 20, so the reason for fixing it can be hard to understand.

In parametric statistics, since we estimate the parameter based on the data I sampled, we always go by the sample data. If you sample more data, you just include the new samples, compute the sample mean, and re-estimate the sample variance.

It's true the sample mean can differ each time you sample, but understand it as a situation where we estimate the parameter with sampling already finished. That's why the sample mean is fixed.

2. You said sampling is already finished and the sample mean is fixed, yet in variance estimation the n-1 data points have variability — isn't that contradictory?

Let's recall that the statistic called variance, before it expresses how far the data lie from the mean, is a basic statistic that came about to quantify uncertainty.

Before reading the main point, I recommend reading the post on uncertainty below.

Uncertainty, Variability, and Variance (feat. the Nature of Probability)

Connecting the nature of probability through three concepts — uncertainty, variability, and variance — including the intuition for why variance is defined as 'distance from the mean.'

taystudios.com/blog

⚠️ A point of confusion

If you focus on the "distance from the mean" concept, you get trapped in the question "obviously I computed distances for n points, so I should divide by n?" — and can't get out. Let me say it again: the intuition that variance is the average of distances is a resulting story; uncertainty is the prior concept. It's because, in trying to assess uncertainty, we ended up computing the difference between the mean and each data point, and as a result the concept of how far each data lies from the mean came about.

A point to think about

Figure 4. n-1 points of variability

Please think based on the figure above. If you consider the sample mean already fixed, then what has variability here is only n-1 points. Put another way, only n-1 can be randomly drawn from the population, and the remaining one is fixed by those n-1, so it doesn't follow the population distribution. At the point the n-1 are determined, probability does not exist. That's why, when you divide the squared (data − sample mean) — which expresses the variability over all the data — by the n-1 data points that carry variability, the representative value called variance arises. This is the way to approach variance more essentially, before the average-distance view.

3. Since the population also has degrees of freedom up to N-1, shouldn't variance be divided by N-1, not N?

In the population, since we compute the parameter based on already-exact values, probability does not exist (it's computation, not estimation). But in a sample, probability still exists, so degrees of freedom exist.

References

12 Math — YouTube — https://www.youtube.com/watch?v=TckEM-6tdrc
Introduction to Statistics — Jayu Academy (textbook)
School course materials — Introduction to Data Analysis
My own thoughts and notes (feat. GPT)

📦 Migrated from the Tistory blog I used to run. Original: taehyuklee.tistory.com/14

A side note

Goal of this post

Main body

1. Why divide by n-1

2. What's the problem if we divide by n?

3. Why does underestimation (biased estimation) happen?

4. Has it been proven mathematically?

5. They call n-1 the degrees of freedom — what is the concept of degrees of freedom?

(Aside) Degrees of freedom in mechanical engineering?

6. When computing variance, what on earth does degrees of freedom mean? Why do we divide by the degrees of freedom? (essence)

3. Since the population also has degrees of freedom up to N-1, shouldn't variance be divided by N-1, not N?

References

Related posts

Comments