Uncertainty, Variability, and Variance (feat. the Nature of Probability)

2024.09.30 ·#statistics #variance #uncertainty #variability #probability

Uncertainty, Variability, and Variance

The goal of this post is to explain the nature of probability through the concepts of uncertainty, variability, and variance. I thought deeply about these concepts when taking university statistics, and I've written this by distilling the thoughts I organized back then.

1. What is uncertainty?

It is the main cause that — due to some variable we can't control — keeps producing a different result even when I perform the exact same action.

(You could see it as the fundamental cause that allows probability to exist.)

For example, say I threw a paper airplane 10 times with the same force in the same direction. Will the paper airplane land at exactly the same spot all 10 times? No. Variables we can't control, like wind, enter in, and the result will keep differing.

2. What is variability?

Here we can explain the concept of variability. It refers to the property of being able to differ each time I perform the action, and it generally applies to each data point of a sample (each point such as X1, X2).

Figure 1. A figure illustrating variability

Let me explain with the figure above. When sampling each data point, since all points are randomly sampled from the population (uncertainty), X1 could be 3, could be 5, could be 10, or even — in the extreme — 100000. That is, the property above is called variability.

3. What is variance?

When we ask how the uncertainty or variability above can be expressed mathematically, one method is to compute and average how far each data point lies from the expected value — the value that represents the group. This is the concept of variance, and it becomes the concept of average distance.

Why was variance defined by the following formula?

$$ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\left(X_i - \bar{X}\right)^2 \qquad \sigma^2 = \frac{1}{N}\sum_{i=1}^{N}\left(X_i - \mu\right)^2 $$ Formula 1. Sample variance, population variance

Figure 2. Why variance is the statistic that represents uncertainty

Think of the mean as the value that represents the group — the value that can explain all the values. However, the farther each data point lies from the mean, the more the mean's explanatory power decreases. That is, you could say uncertainty increases.

Looking at Figure 2 above: in figure (a), the mean lies near X1 through X5, so the mean can explain most of the values; but in case (b), the distances from the mean are large, so it looks hard to explain each data point with the mean itself.

In other words, in case (a), even using the mean as the sample's representative is fine, since each data point's uncertainty (the degree it deviates from the mean) is low. Conversely, in case (b), some data may be near the mean and fine, but many data points have high uncertainty (the degree they deviate from the mean), so using the mean as a representative value may be a stretch.

We can see that uncertainty can be quantified by expressing it as the degree of deviation from the mean = the distance from the mean. The higher the variance, the higher the degree of deviation when you perform the same action — so you never know what stray data point, far off from the mean, might pop out.

Summary

The above can be summarized as follows.

Uncertainty — a variable / cause we can't control

Variability — the property that the result differs each time you perform the action, due to uncontrollable variables (i.e., due to uncertainty)

Variance — the statistic that mathematically quantifies uncertainty / variability

References

Introduction to Statistics — Jayu Academy (textbook)
School course materials — Introduction to Data Analysis
My own organized thoughts and notes

📦 Migrated from the Tistory blog I used to run. Original: taehyuklee.tistory.com/13