Testing Means: the t-test [One, Two (Student's & Welch's), Paired Samples t-test]

2024.11.03 ·#statistics #t-test #mean-test #homoscedasticity #normality

The rationale for testing means (why do we need it? feat. uncertainty/variability)
Definition and understanding of testing means
Assumptions of testing means (normality, equal variance: Two Samples t-test)
Types of mean tests and actual worked examples

If you want the part where this is implemented directly in Python with the SciPy library, I've linked the post below.

Implementing the t-test (t test) [One, Two, Paired Samples t-test: Python feat. SciPy & Statsmodels]

Turning the t-test theory from the previous post into SciPy code — ttest_1samp, ttest_ind (equal_var toggle), ttest_rel, with axis options. Verifying the hand-computed t and p match SciPy.

taystudios.com/blog

Introduction (an overview that began from a question)

When you compare the means of two groups, does a difference in the numbers really mean they're different? Couldn't the sample means differ simply due to variability (uncertainty)?

Here — couldn't the sample means differ simply due to variability (uncertainty)? — you need to understand what this statement means in order to understand testing means, and further, hypothesis testing.

Let me explain the above simply.

Figure 1. Two independent groups (same mean and variance)

Suppose, to compare the effects of drug 1 and drug 2, we administered each to a group made up of people of the same constitution. In reality, the two drugs are known to have the same mean effect and variance. But we draw a sample, administer the drug, and observe the result. As a result, suppose the mean of the group given drug 1 appeared at position ① in the figure, and the mean of the group given drug 2 at position ②.

In this case, can we conclude that the two drugs' effects differ just because their sample means differ? No. This situation is actually a case where the two drugs' effects are the same, but variability made the two groups' sample means appear different.

Figure 2. Significance level 5% : adding a drug-3 example

Therefore, the core of a mean test is the process of judging whether a particular mean is a result within the ordinary distribution (95%) that appeared by mere variability, or a result corresponding to extreme probability (5%) beyond variability. For this, the process of computing a test statistic and getting the p-value from its value was already covered in hypothesis testing.

In Figure 2 I additionally drew X3's known distribution. Interpreting the figure intuitively: X1 and X2's sample means can be seen as the same, and X3's sample mean as different.

Also, since it's hard to spread out all three groups' distributions like this and judge whether two groups' means are significantly the same or different, the method of comparing means using a test statistic is exactly what we call "testing means."

Roughly, I tried to convey the feeling of turning two groups into a single test statistic (random variable). (I hope you get an intuitive feel.)

💡 As an aside: (based on Figure 2) what if, when you draw a sample from distribution 3, it happens to fall within the 95% region of distributions 1 and 2? Conversely, what if groups 1 and 2 are actually the same, but by chance one sample is found in the 5% region? If you have such questions — congratulations, you're ready to understand Type I and Type II errors. But that's not the topic of this post, so I'll move on for now.

I previously explained hypothesis testing using a One-Sample t-test (mean test) as an example, so I recommend referring to that content first.

Understanding Hypothesis Testing: Test Statistic, Null Hypothesis, Alternative Hypothesis, p-value, Significance Level, Critical Value — Concepts (the essence)

From the intuition of hypothesis testing to the meaning of the test statistic, plus null/alternative hypotheses, p-value, significance level, and critical value — all connected through a single example.

taystudios.com/blog

Main body (about testing means)

1. What is testing means?

Testing means is a statistical technique for judging whether the means of two groups are statistically significantly the same or different.

2. Assumptions of testing means

An assumption in hypothesis testing means that the theory was established on top of that assumption. Therefore, depending on whether such assumptions are met, the technique used must change.

2.1 Normality assumption

The t-test is performed under the assumption that the population follows a normal distribution. This is an especially important assumption when the sample size is small.

If the sample size is sufficiently large (generally n > 30), then by the central limit theorem the sample mean approaches a normal distribution, so even if the population doesn't perfectly follow normality, the t-test can be used. Please refer to the central limit theorem post.

The Central Limit Theorem (CLT): Definition, Experiments & Uses

As the sample size grows, the sample mean converges to a normal distribution for any population — verified with five distributions, plus why the CLT matters for estimation and confidence intervals.

taystudios.com/blog

Question: By the central limit theorem (CLT), as the number of samples grows, the sample mean follows a normal distribution — so why use the t-distribution rather than the normal (Z) distribution?

Answer: As the degrees of freedom grow, the t-distribution approaches the Z-distribution anyway; and to actually use the Z-distribution you'd need to know the population standard deviation, which we don't know.

As df grows, the t-distribution approaches the Z-distribution Reference figure: as the sampling size (df) grows, the t-distribution approaches the Z-distribution (at df=30 it's almost like the Z-distribution)

When normality does not hold

If the number of data points is clearly small or ambiguous and it doesn't follow a normal distribution, we use a nonparametric test.

Nonparametric test: a method for testing the difference between two groups that doesn't require the normality assumption. Examples include the Mann-Whitney U test (for independent samples) and the Wilcoxon signed-rank test (for paired samples).

For reference, nonparametric tests like the Mann-Whitney U test or the Wilcoxon signed-rank test are things I've only used, and as of now are still something I'll study going forward.

2.2 Equal-variance (homoscedasticity) assumption

This assumption applies only to the independent-samples t-test (Two Sample t-test) among the types of mean tests below. The reason is that, because the independent-samples t-test compares the mean difference between two mutually independent groups, the independence of the two groups is required as a basic assumption of the test.

A thought based on my understanding

For the One Sample t-test, since it tests the difference from a particular reference value, equal variance can't be checked; and the Paired Sample t-test compares two measurements of the same group, so checking equal variance isn't necessary.

Question: What if you run a Two Sample t-test even after confirming that the two independent groups' variances differ?

Answer: Suppose, to compare a new drug's efficacy with an existing drug, we ran an independent-samples t-test with two groups. And suppose the data of the experimental group taking the new drug turned out to have considerably large variance. If, despite this, the t-test concluded that the mean difference between the two groups is significant, this can cause several problems.

The new drug may show, on average, a blood-pressure-lowering effect, but large variance means the differences in effect among individual patients are large. That is, for some patients the new drug may have no effect at all, or may even raise blood pressure. Such a situation can greatly reduce the new drug's reliability and lowers the accuracy of predicting the effect for each individual patient. In the end, the consistency and reliability of treatment may be lacking, which can seriously affect patient care.

3. Types of mean tests

3.1 One Sample t-test

Purpose: to check whether one sample's mean is significantly different from a particular reference value (a known population mean).
Example: checking whether a new drug has an effect compared to a baseline blood pressure.

[Table1. Patient number & blood pressure measured after drug administration]

Patient	Blood pressure after dosing (mm Hg)
Patient 1	118
Patient 2	121
Patient 3	119
Patient 4	117
Patient 5	120

Test: test whether this group's mean blood pressure is significantly different from the baseline (120 mm Hg).

Test statistic and degrees of freedom

$$ t = \frac{\bar{X} - \mu_0}{\,s/\sqrt{n}\,} $$ Formula 1. One Sample t-test test statistic

The meaning of the test statistic for a mean test can be explained intuitively. The farther the sample mean's difference from the particular mean I want is from 0, the more the two means differ; the closer to 0, the closer they are. Dividing by the standard deviation has the meaning of standardizing. And this test statistic is known to follow the t-distribution. (Please refer to the earlier hypothesis-testing post.)

Based on the example above, computing the t statistic — with sample mean 119 and reference value 120 — gives t-value ≈ -1.41.

One Sample hypothesis test (t-distribution) Figure 3. One sample t-test hypothesis test (t-distribution)

Since the test statistic for testing means follows the t-distribution, I computed the p-value on a t-distribution with 4 degrees of freedom. Because we judge whether they're the same or different, not a greater/less relationship, a two-sided test must be performed, and each side's significance level is set to 0.025. As a result, the p-value of the region where t-value is below -1.41 came out to 0.2313. Since this p-value is larger than 0.025, the null hypothesis can't be rejected, and we can judge there's no statistically significant difference. (Accept the null hypothesis.)

3.2 Two Sample t-test

Purpose: to check whether the means of two mutually independent groups are significantly different.
Example: comparing the mean blood pressure of a group that took the drug and a group that didn't.

[Table2. Patient number, group classification & blood pressure measured by dosing status]

Patient	Group	Blood pressure after dosing (mm Hg)
Patient 1	Dosed	115
Patient 2	Dosed	118
Patient 3	Dosed	116
Patient 4	Not dosed	122
Patient 5	Not dosed	124

Test: test whether the mean blood pressure of the dosed and not-dosed groups is significantly different. (The two groups' sample sizes don't have to be equal.)

Test statistic and degrees of freedom

For testing means between two groups, it splits into the case where the two groups have equal variance and the case where they don't. (The reason and example were covered in the mean-test assumptions above.)

*Note — for the example above, considering equal variance or the imbalance in sample sizes, it's appropriate to do Welch's t-test; but for the sake of explaining the concept, I'll compute both Student's t-test (Pooled t-test) and Welch's t-test.

a) When the two groups' variances are significantly equal (equal variance holds) — Student's t-test (Pooled t-test)

$$ t = \frac{\bar{X}_1 - \bar{X}_2}{\,s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\,}, \qquad s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2} $$ Formula 2. Two Samples t-test test statistic

Based on the example above, computing the t statistic gives t-value ≈ -4.91.

Figure 4. Student's t-test hypothesis test (t-distribution)

Likewise, the hypothesis test is done on the t-distribution. The p-value of the region where t-value is below -4.91 came out to 0.0162. Since this p-value is smaller than 0.025, the null hypothesis can be rejected, and the alternative hypothesis is accepted. That is, we can judge the two groups' means differ.

b) When the two groups' variances differ (equal variance does not hold) — Welch's t-test

$$ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\dfrac{s_1^2}{n_1}+\dfrac{s_2^2}{n_2}}} $$ Formula 3. Welch's t-test test statistic

Based on the example above, computing the t statistic gives t-value ≈ -6.47.

Figure 5. Welch's t-test

Welch's t-test is also known to follow the t-distribution, and the hypothesis test is performed on that basis. The p-value of the region where t-value is below -6.47 came out to 0.0137. Since this p-value is smaller than 0.025, the null hypothesis can be rejected, and the alternative hypothesis is accepted. Therefore we can judge the two groups' means differ.

3.3 Paired Sample t-test

Purpose: a method for comparing the mean difference of two corresponding measurements in the same group.
Example: comparing the change in blood pressure before and after drug administration. When you want to see the difference in a subject's mean change before and after a particular event (before/after a process treatment, before/after a clinical trial, before/after PT, etc.).

[Table3. Patient number & blood pressure before/after drug administration]

Patient	Before dosing (mm Hg)	After dosing (mm Hg)
Patient 1	130	120
Patient 2	128	119
Patient 3	135	125
Patient 4	132	123
Patient 5	129	121

Test: test whether the mean blood-pressure difference before and after dosing is significant.

Test statistic and degrees of freedom

$$ t = \frac{\bar{D}}{\,s_D/\sqrt{n}\,} $$ Formula 4. Paired Samples t-test test statistic

As in the calculation process, you take the difference values of the corresponding samples, average them, and compute the variance. D(i) lists each corresponding sample's difference (before − after). Running the hypothesis test on this gives t-value ≈ 24.50.

Paired t-test hypothesis test Figure 6. Paired t-test

The p-value of the region where t-value is at or above 24.50 is nearly 0. Since this p-value is smaller than 0.025, we can reject the null hypothesis and accept the alternative hypothesis. We can judge the two groups' means differ.

Closing (conclusion)

Testing the means of two groups is a test technique for judging whether the difference between two groups' means is due to mere variability (uncertainty) or whether the means are actually different. We looked at the assumptions of the theory accordingly (the theories built on those assumptions), and also looked at how each assumption is computed through actual examples. Next time, I plan to look at how to use the libraries in Python.

References

Basic (introductory) statistics course — Jayu Academy (textbook)
Introduction to Data Analysis — course materials
The intuition I understood (from reading many textbooks)
Wiki — Welch's t-test
Wiki — Student's t-test

📦 Migrated from the Tistory blog I used to run. Original: taehyuklee.tistory.com/24

Testing Means: the t-test [One, Two (Student's & Welch's), Paired Samples t-test]

Table of contents

Introduction (an overview that began from a question)