Statistical Analysis, Hypothesis Testing & A/B Testing
Why Statistical Rigour Matters
Data analysts work with samples, not populations. Every metric you compute has uncertainty. Every comparison you make could be the result of random variation rather than a true difference. Statistical analysis is the discipline of quantifying that uncertainty and drawing conclusions that are defensible against the charge of seeing patterns in noise.
This lesson is not a probability theory course. It is a practitioner's guide to applying the right statistical tool to the right question, interpreting results correctly, and avoiding the most common mistakes that undermine analytical credibility.
Descriptive vs Inferential Statistics
Descriptive statistics summarise what is in your sample. Inferential statistics generalise from your sample to a broader population or use probability to decide whether an observed difference is real.
The distinction matters because the appropriate tool depends on which question you are answering:
- "What is the average order value in our 2023 data?" → Descriptive. Just compute it.
- "Is the average order value of Enterprise customers significantly higher than SMB customers?" → Inferential. You need a hypothesis test.
Distributions That Matter
Confidence Intervals
A confidence interval expresses the range within which the true population parameter is likely to fall, given the observed sample. A 95% CI does not mean "there is a 95% probability the true value is in this range" — it means "if we repeated this sampling process 100 times, approximately 95 of the resulting CIs would contain the true value."
Hypothesis Testing Framework
Before running any test, state the hypotheses explicitly:
H₀ (null hypothesis): The default assumption — no difference, no effect. H₁ (alternative hypothesis): What you are trying to detect. α (significance level): Probability of falsely rejecting H₀ when it is true (typically 0.05). p-value: Probability of observing a result at least as extreme as yours if H₀ were true.
Decision rule: If p < α, reject H₀. If p ≥ α, fail to reject H₀ (not the same as proving H₀ is true).
Choosing the Right Test
The Complete Hypothesis Testing Toolkit
Two-Sample Tests: t-test and Mann-Whitney U
Chi-Square Test of Independence
One-Way ANOVA and Kruskal-Wallis
A/B Testing: Full Experiment Workflow
Avoiding Early Stopping
Multiple Testing Correction
When you test multiple hypotheses on the same data, the probability that at least one test produces a false positive increases dramatically. At α=0.05 with 20 tests, you expect 1 false positive by chance alone.
Simpson's Paradox
Simpson's paradox occurs when a trend appears in aggregated data but reverses (or disappears) when the data is segmented. It is one of the most dangerous misinterpretations in descriptive analysis.
Bootstrapping
When parametric assumptions don't hold and sample sizes are small, bootstrapping provides distribution-free confidence intervals.
Key Takeaways
- Descriptive statistics describe your sample; inferential statistics generalise to a population. Know which question you are answering before choosing your tool.
- State H₀ and H₁ explicitly before running any test. Never formulate a hypothesis after seeing the data (HARKing — Hypothesizing After Results are Known) and report it as confirmatory.
- The right test depends on the number of groups, the variable type, whether samples are paired, and whether normality holds. Use the decision matrix and default to non-parametric tests when normality is questionable.
- Statistical significance (p < α) is not practical significance. Always compute and report effect size — Cohen's d for t-tests, Cramér's V for chi-square, eta-squared for ANOVA. A tiny effect can be statistically significant with a large sample.
- A/B testing requires pre-computing sample size based on baseline rate, minimum detectable effect, and desired power. Running a test without sample size planning and stopping when you see p < 0.05 is p-hacking.
- The multiple testing problem is real: with 20 tests at α=0.05, you expect 1 false positive by chance. Apply Bonferroni (conservative) or Benjamini-Hochberg FDR (more powerful) correction whenever you run more than one test on the same dataset.
- Simpson's paradox — where aggregated trends reverse when segmented — is one of the most consequential misinterpretations in business analysis. Always check for confounding before making comparative claims.
- Bootstrapping provides distribution-free confidence intervals for any statistic, making it the go-to method for non-normal data, small samples, and statistics like median or percentiles that lack closed-form CIs.