Hypothesis Testing
Hypothesis testing gives you a principled framework for making decisions from noisy data. It is used in A/B testing, clinical trials, product analytics, and scientific research. Unfortunately, it is also one of the most widely misunderstood concepts in quantitative work. This lesson teaches the mechanics and the mindset to apply it correctly.
The Null Hypothesis Framework
Every hypothesis test involves two competing explanations:
- Null hypothesis (H₀): The default assumption — there is no effect, no difference, no relationship. We assume H₀ is true until evidence overwhelmingly suggests otherwise.
- Alternative hypothesis (H₁): The claim we want to support — there is an effect, difference, or relationship.
The test does not prove H₁. It asks: "How unlikely is the data I observed, assuming H₀ is true?" If the data is very unlikely under H₀, we reject H₀ in favour of H₁.
Type I and Type II Errors
Two types of mistakes are possible:
| | H₀ is True | H₀ is False | |---|---|---| | Reject H₀ | Type I Error (False Positive) — α | Correct (Power = 1 − β) | | Fail to Reject H₀ | Correct | Type II Error (False Negative) — β |
- α (significance level): The probability of a Type I error you are willing to accept. Conventionally set at 0.05.
- β: The probability of a Type II error. Power = 1 − β; you want power ≥ 0.80.
- Reducing α (being more conservative) increases β — there is a fundamental tradeoff.
P-Values: What They Are and Are Not
The p-value is the probability of observing data as extreme as (or more extreme than) what you observed, given that H₀ is true.
A p-value of 0.03 means: "If H₀ were true, there is a 3% chance of seeing a result this extreme or more extreme by random variation alone."
What a p-value is not:
- The probability that H₀ is true.
- A measure of effect size or practical significance.
- Proof of anything.
The Independent Samples t-Test
Use the t-test when you want to compare the means of two groups drawn from approximately normal distributions:
Check assumptions before interpreting results:
The Chi-Square Test
Use the chi-square test for independence when both variables are categorical:
Choosing the Right Test
| Situation | Recommended Test | |---|---| | Compare means of 2 independent groups (normal data) | Independent samples t-test | | Compare means of 2 paired/repeated groups | Paired t-test | | Compare means of 3+ groups | One-way ANOVA | | Test association between 2 categorical variables | Chi-square test of independence | | Compare 2 groups with non-normal distributions | Mann-Whitney U test | | Test correlation between 2 continuous variables | Pearson correlation (normal) / Spearman (rank) |
Common Pitfalls
P-hacking (data dredging): Running many tests and reporting only the ones with p < 0.05. If you run 20 tests at α = 0.05, you expect 1 false positive by chance. Correct for multiple comparisons using Bonferroni correction (divide α by the number of tests) or the Benjamini-Hochberg procedure.
Ignoring effect size: A p-value of 0.001 with n = 100,000 may indicate a trivially small, practically meaningless difference. Always report effect size alongside p-values:
Stopping early: Peeking at A/B test results and stopping as soon as p < 0.05 inflates Type I error. Use sequential testing methods (e.g., always-valid p-values) or pre-register your sample size using a power analysis.
Summary
- The null hypothesis assumes no effect; the test asks how improbable the observed data is under that assumption.
- α (Type I error rate) and β (Type II error rate) are in fundamental tension; set both before collecting data.
- A p-value is the probability of the data given H₀ — not the probability that H₀ is true.
- Use the t-test for comparing two group means (check normality and equal variance); use the chi-square test for associations between categorical variables.
- P-hacking, ignoring effect size, and early stopping are the three most common ways hypothesis testing goes wrong in practice.