Uncertainty, Variability & Variance — Understanding the Nature of Probability

The goal of this post is to explain the nature of probability through the concepts of uncertainty, variability, and variance. It organizes ideas I thought hard about in a university statistics course.

1. What is uncertainty?

It's the fundamental cause that makes the same action keep producing different results because of variables we can't control. (You could call it the root cause that lets probability exist at all.)

For example, if you throw a paper airplane 10 times with the exact same force and direction, will it land at the same spot all 10 times? No. Uncontrollable variables like wind enter in, so the result keeps changing.

2. What is variability?

This is where variability comes in. It's the property that the result can differ each time you perform the action, and it applies to each data point of a sample ($X_1, X_2, \dots$).

A figure to explain variability — random sampling of data points Figure 1. A figure to explain variability

When sampling each data point, since every point is randomly sampled from the population (uncertainty), $X_1$ could be 3, or 5, or 10, or even an extreme 100000. We call this property variability.

3. What is variance?

To express uncertainty and variability mathematically, you compute how far each data point lies from the expected value (mean) that represents the group, and average it. That is variance, and it becomes the notion of "average distance."

Why is variance defined with this formula?

$$s^2 = \dfrac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 \qquad\qquad \sigma^2 = \dfrac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$$ Sample variance ($s^2$) · population variance ($\sigma^2$)

Why variance is the statistic that represents uncertainty — comparing the explanatory power of the mean Figure 2. Why variance is the statistic that represents uncertainty

The mean is the value representing the group. But the farther each data point is from the mean, the less the mean explains — i.e., uncertainty grows.

In Figure 2, (a) has the mean near $X_1$–$X_5$, so the mean explains most of it; in (b) the data are far from the mean, so the mean alone struggles to explain. (a) has low uncertainty, so using the mean as a representative value is fine; (b) has high uncertainty, so it's a stretch to use it as a representative value.

In other words, uncertainty can be quantified as how much it deviates from the mean = the distance from the mean. The higher the variance, the greater the deviation in the same action, and the higher the chance of getting an odd data point far from the mean.

Summary

  • Uncertainty — variables/causes we can't control
  • Variability — the property that results differ each time due to uncertainty
  • Variance — the statistic that mathematically quantifies uncertainty/variability

References

  1. Introduction to Statistics — Jayu Academy (textbook)
  2. Course material — Introduction to Data Analysis

📦 Migrated from my own Korean blog (my own writing). Original: taehyuklee.tistory.com/13

Share𝕏f

Comments