Probability is the language in which every machine learning model is written. Whether you are reading a logistic regression output, tuning a Naive Bayes classifier, or interpreting a Bayesian A/B test, you are speaking probability. This lesson makes that language fluent — not with abstract measure theory, but with the distributions you will encounter repeatedly in real data science work, implemented directly in Python.
Probability Axioms and Sample Spaces
A sample space S is the set of all possible outcomes of an experiment. An event A is a subset of S. Probability assigns a number P(A) to each event satisfying three axioms:
Non-negativity: P(A) >= 0 for all events A
Normalization: P(S) = 1
Additivity: for mutually exclusive events A and B, P(A or B) = P(A) + P(B)
Everything in probability — every distribution, every theorem — follows from these three axioms. The practical implication is that probabilities must sum (or integrate) to 1, which constrains how models can output scores and how we interpret them.
Conditional probability captures how probability changes when we gain information. P(A|B) — the probability of A given that B occurred — is defined as:
P(A|B) = P(A and B) / P(B)
This is not just a formula; it is a belief-updating operation. When B is observed, the sample space shrinks to B, and we renormalize.
Independence means observing B gives us no information about A: P(A|B) = P(A), or equivalently P(A and B) = P(A) * P(B). Independence is stronger than mutual exclusivity — two events are mutually exclusive when P(A and B) = 0, meaning they cannot both occur. Mutually exclusive events with nonzero probability are always dependent (knowing one occurred rules out the other).
python
import numpy as npimport scipy.stats as statsimport matplotlib.pyplot as pltfrom matplotlib.gridspec import GridSpecrng = np.random.default_rng(42)# Demonstrate conditional probability with a coin flip experiment# P(second flip = H | first flip = H) for a fair coin = 0.5 (independent)n_trials = 100_000flips = rng.integers(0, 2, size=(n_trials, 2)) # 0=T, 1=Hfirst_heads = flips[:, 0] == 1both_heads = (flips[:, 0] == 1) & (flips[:, 1] == 1)p_second_given_first = both_heads.sum() / first_heads.sum()print(f"P(second=H | first=H) ≈ {p_second_given_first:.4f}") # ~0.5 — independent
Bernoulli Distribution
The simplest distribution: a single trial with success probability p.
PMF: P(X = 1) = p, P(X = 0) = 1 - p
Mean: p
Variance: p(1 - p)
It models one click, one email open, one patient outcome. Every row of a binary target column in a classification dataset is a Bernoulli draw.
python
from scipy.stats import bernoullip = 0.3dist = bernoulli(p)print(f"PMF P(X=1) = {dist.pmf(1):.2f}")print(f"Mean = {dist.mean():.2f}")print(f"Variance = {dist.var():.4f}")# Fit to observed binary data (MLE: p_hat = mean of observations)data = rng.choice([0, 1], size=1000, p=[0.7, 0.3])p_hat = data.mean()print(f"MLE estimate of p: {p_hat:.3f}")
Binomial Distribution
Extend Bernoulli to n independent trials. The Binomial(n, p) distribution counts the number of successes k in n trials.
PMF: P(X = k) = C(n, k) * p^k * (1-p)^(n-k)
Mean: n * p
Variance: n * p * (1 - p)
Use it to model: how many of 1000 emails are opened, how many of 200 patients respond to treatment, how many of 50 A/B test conversions occur in a bucket.
python
from scipy.stats import binomn, p = 20, 0.3dist = binom(n, p)k_values = np.arange(0, n + 1)pmf_values = dist.pmf(k_values)fig, axes = plt.subplots(1, 2, figsize=(12, 4))axes[0].bar(k_values, pmf_values, color='steelblue', alpha=0.8, edgecolor='white')axes[0].axvline(dist.mean(), color='red', linestyle='--', label=f'Mean = {dist.mean():.1f}')axes[0].set_title('Binomial(n=20, p=0.3) PMF')axes[0].set_xlabel('k (number of successes)')axes[0].set_ylabel('P(X = k)')axes[0].legend()# CDF: probability of at most k successesaxes[1].step(k_values, dist.cdf(k_values), where='post', color='steelblue')axes[1].set_title('Binomial CDF')axes[1].set_xlabel('k')axes[1].set_ylabel('P(X <= k)')plt.tight_layout()plt.savefig('binomial_distribution.png', dpi=150, bbox_inches='tight')plt.show()# Practical: probability that exactly 8 of 20 emails are openedprint(f"P(X=8) = {dist.pmf(8):.4f}")# Probability that at most 5 emails are openedprint(f"P(X<=5) = {dist.cdf(5):.4f}")# Probability that more than 10 are openedprint(f"P(X>10) = {1 - dist.cdf(10):.4f}")
When n is large and p is small (np stays moderate), Binomial converges to Poisson. The cutoff of thumb: n >= 20 and p <= 0.05.
Poisson Distribution
The Poisson(lambda) distribution counts rare events in a fixed interval — requests per second, bugs per thousand lines of code, customer arrivals per hour.
PMF: P(X = k) = e^(-lambda) * lambda^k / k!
Mean: lambda
Variance: lambda (mean equals variance — a diagnostic check for Poisson data)
python
from scipy.stats import poissonlam = 4.5 # average events per intervaldist = poisson(lam)k_values = np.arange(0, 20)fig, ax = plt.subplots(figsize=(8, 4))ax.bar(k_values, dist.pmf(k_values), color='coral', alpha=0.8, edgecolor='white')ax.axvline(lam, color='red', linestyle='--', label=f'lambda = {lam}')ax.set_title(f'Poisson(lambda={lam}) PMF')ax.set_xlabel('k')ax.set_ylabel('P(X = k)')ax.legend()plt.tight_layout()plt.savefig('poisson_distribution.png', dpi=150, bbox_inches='tight')plt.show()# Fit Poisson to data: MLE estimate is simply the sample meancount_data = rng.poisson(lam=4.5, size=500)lambda_hat = count_data.mean()print(f"True lambda: {lam}, MLE estimate: {lambda_hat:.3f}")print(f"Sample variance: {count_data.var():.3f}") # Should be close to lambda_hat# When to use Poisson vs Binomial# Binomial: n=10000 trials, p=0.00045 → lambda = n*p = 4.5n_large, p_small = 10_000, 0.00045binom_approx = binom(n_large, p_small)print(f"\nP(X=5) via Binomial: {binom_approx.pmf(5):.6f}")print(f"P(X=5) via Poisson: {dist.pmf(5):.6f}") # Nearly identical
Normal Distribution
The Normal(mu, sigma) distribution is the workhorse of statistics. Its ubiquity comes from the Central Limit Theorem — sums of independent random variables converge to Normal regardless of the original distribution.
The empirical rule: 68% of data falls within 1 sigma of the mean, 95% within 2 sigma, 99.7% within 3 sigma. These cutoffs are everywhere in outlier detection, confidence interval construction, and hypothesis testing.
When a variable is the product of many small positive factors, its logarithm is Normal — and the variable itself is Log-Normal. Income distributions, asset prices, system latency, and biological measurements are commonly log-normal.
Key property: if log(X) ~ Normal(mu, sigma), then X ~ LogNormal(mu, sigma).
python
from scipy.stats import lognorm# lognorm in scipy: parameterised by s=sigma, scale=exp(mu)mu_underlying, sigma_underlying = 2.0, 0.5dist = lognorm(s=sigma_underlying, scale=np.exp(mu_underlying))x = np.linspace(0.01, 30, 400)fig, axes = plt.subplots(1, 2, figsize=(12, 4))axes[0].plot(x, dist.pdf(x), color='purple', linewidth=2)axes[0].set_title('Log-Normal PDF (right-skewed)')axes[0].set_xlabel('x')# Taking log transforms a log-normal back to normallog_data = rng.normal(mu_underlying, sigma_underlying, size=1000)raw_data = np.exp(log_data)axes[1].hist(raw_data, bins=50, density=True, alpha=0.6, color='purple', label='Raw data')axes[1].plot(x, dist.pdf(x), 'r-', linewidth=2, label='Fitted PDF')axes[1].set_title('Empirical log-normal data')axes[1].legend()plt.tight_layout()plt.savefig('lognormal_distribution.png', dpi=150, bbox_inches='tight')plt.show()# Fit log-normal to datas_hat, loc_hat, scale_hat = lognorm.fit(raw_data, floc=0)print(f"Fitted sigma: {s_hat:.3f}, fitted mu: {np.log(scale_hat):.3f}")
Exponential Distribution
The Exponential(lambda) distribution models the time between events in a Poisson process — waiting time for the next server request, time between customer arrivals, time until component failure.
Its defining property is memorylessness: P(X > s + t | X > s) = P(X > t). Having waited already does not change the remaining wait time. This is often unrealistic for failure times (machines wear out), which is why Weibull distributions are used in reliability engineering.
python
from scipy.stats import expon# scipy parameterises Exponential via scale = 1/lambdalam = 0.5 # ratedist = expon(scale=1/lam)x = np.linspace(0, 10, 300)fig, ax = plt.subplots(figsize=(7, 4))ax.plot(x, dist.pdf(x), color='green', linewidth=2, label=f'Exp(lambda={lam})')ax.fill_between(x, dist.pdf(x), alpha=0.2, color='green')ax.set_title('Exponential PDF')ax.set_xlabel('Time until next event')ax.legend()plt.tight_layout()plt.savefig('exponential_distribution.png', dpi=150, bbox_inches='tight')plt.show()# Fit to inter-arrival timesinter_arrivals = rng.exponential(scale=2.0, size=300)scale_hat = inter_arrivals.mean() # MLE for exponential is the sample meanprint(f"MLE lambda: {1/scale_hat:.4f} (true: 0.5)")
Beta Distribution
The Beta(alpha, beta) distribution lives on [0, 1], making it the natural model for probabilities, proportions, and rates. In Bayesian A/B testing, it is the conjugate prior for the Binomial — the posterior after observing successes and failures is still Beta.
Interpretation: Beta(alpha, beta) can be thought of as representing a state of belief after seeing (alpha - 1) successes and (beta - 1) failures from a uniform prior.
When you have no information favouring any value in a range, use Uniform(a, b). It is the maximum entropy distribution over a bounded interval and serves as a non-informative prior in Bayesian analysis.
python
from scipy.stats import uniformdist = uniform(loc=2, scale=8) # Uniform on [2, 10], scale = b - aprint(f"Mean: {dist.mean():.1f}")print(f"Variance: {dist.var():.4f}")print(f"P(X < 5): {dist.cdf(5):.3f}") # (5-2)/(10-2) = 0.375
Maximum Likelihood Estimation
MLE finds the parameter values that maximise the probability of observing the data. For each distribution, scipy provides .fit() which runs MLE under the hood.
python
# MLE for multiple distributions on the same datasettrue_data = rng.gamma(a=3.0, scale=2.0, size=500)# Try fitting Normal and Gammamu_hat, sigma_hat = norm.fit(true_data)from scipy.stats import gamma as gamma_dista_hat, loc_hat, scale_hat = gamma_dist.fit(true_data, floc=0)print("Fitting gamma-distributed data:")print(f" Normal fit: mu={mu_hat:.2f}, sigma={sigma_hat:.2f}")print(f" Gamma fit: a={a_hat:.2f}, scale={scale_hat:.2f}")# Log-likelihood comparison (higher is better)ll_norm = norm.logpdf(true_data, mu_hat, sigma_hat).sum()ll_gamma = gamma_dist.logpdf(true_data, a_hat, loc_hat, scale_hat).sum()print(f"\nLog-likelihood Normal: {ll_norm:.2f}")print(f"Log-likelihood Gamma: {ll_gamma:.2f}") # Should be higher
QQ Plots for Normality Checking
A QQ plot (quantile-quantile plot) compares the quantiles of your data against the theoretical quantiles of a Normal distribution. If the data is Normal, the points fall on a straight line. Deviations from the line reveal skew, heavy tails, or truncation.
python
from scipy.stats import probplotfig, axes = plt.subplots(1, 3, figsize=(15, 4))datasets = [ (rng.normal(0, 1, 300), 'Normal data (should be linear)'), (rng.exponential(1, 300), 'Exponential data (right-skewed)'), (rng.standard_t(df=3, size=300), 't(df=3) — heavy tails'),]for ax, (data, title) in zip(axes, datasets): probplot(data, dist='norm', plot=ax) ax.set_title(title) ax.get_lines()[1].set_color('red')plt.tight_layout()plt.savefig('qq_plots.png', dpi=150, bbox_inches='tight')plt.show()
Interpreting QQ plots:
Points curving upward at the right tail: right skew (log-transform candidate)
S-shape (points above the line at both ends): heavy tails
Points below the line at both extremes: light tails (platykurtic)
The Central Limit Theorem
The CLT states that the sampling distribution of the sample mean converges to Normal as n increases, regardless of the population distribution. This is why Normal distributions appear everywhere — not because data is Normal, but because averages are.
python
fig, axes = plt.subplots(2, 3, figsize=(15, 8))population_samples = rng.exponential(scale=1, size=100_000)sample_sizes = [1, 5, 10, 30, 100, 500]for ax, n in zip(axes.flatten(), sample_sizes): sample_means = [ rng.choice(population_samples, size=n).mean() for _ in range(2000) ] ax.hist(sample_means, bins=40, density=True, color='steelblue', alpha=0.7, edgecolor='white') # Overlay theoretical Normal mu_theory = 1.0 sigma_theory = 1.0 / np.sqrt(n) x_range = np.linspace(min(sample_means), max(sample_means), 200) ax.plot(x_range, norm.pdf(x_range, mu_theory, sigma_theory), 'r-', linewidth=2) ax.set_title(f'n = {n}') ax.set_xlabel('Sample mean')plt.suptitle('CLT: Sampling Distribution of Mean (from Exponential population)', fontsize=13, y=1.01)plt.tight_layout()plt.savefig('central_limit_theorem.png', dpi=150, bbox_inches='tight')plt.show()
By n=30 the sampling distribution looks approximately Normal even for the heavily skewed Exponential population. This is why statistical tests that assume Normality of means often work in practice even when the raw data is not Normal — provided n is large enough.
KL Divergence — Measuring Distribution Differences
The Kullback-Leibler divergence measures how much one probability distribution differs from another. It is the expected log-ratio of probabilities under distribution P.
KL(P || Q) = sum over x of P(x) * log(P(x) / Q(x))
KL divergence is not symmetric (KL(P||Q) != KL(Q||P)) and equals 0 when the distributions are identical. It appears in variational inference, information theory, and as the theoretical foundation of cross-entropy loss in neural networks.
| Situation | Distribution |
|---|---|
| Binary outcome (click / no click) | Bernoulli(p) |
| Count of successes in n trials | Binomial(n, p) |
| Count of rare events per interval | Poisson(lambda) |
| Sums and averages of many variables | Normal(mu, sigma) |
| Multiplicative processes, income, latency | Log-Normal |
| Time between Poisson events | Exponential(lambda) |
| A probability / proportion as random variable | Beta(alpha, beta) |
| No information over bounded range | Uniform(a, b) |
Key Takeaways
Probability axioms (non-negativity, normalization, additivity) constrain every model output; conditional probability P(A|B) = P(A and B) / P(B) is the engine of inference.
Bernoulli and Binomial model binary and count outcomes; Poisson handles rare events where n is large and p is small, with the diagnostic that mean equals variance.
The Normal distribution's dominance stems from the CLT — it is the limit of sample means, not necessarily of raw data, which is why normality assumptions about averages are often safe even with non-normal data.
Log-Normal distributions arise from multiplicative processes; income, latency, and prices are better modelled (and transformed with log1p) using log-normal assumptions.
MLE via scipy.stats.dist.fit() returns the parameter values maximising the probability of the observed data; always compare log-likelihoods across candidate distributions.
QQ plots are the fastest diagnostic for distribution shape; deviations from the reference line indicate skew, heavy tails, or truncation that will violate model assumptions downstream.
The Beta distribution is the natural model for probabilities-as-unknowns and is the conjugate prior for Bernoulli/Binomial data, enabling closed-form Bayesian updating of conversion rates.
KL divergence quantifies the cost of approximating one distribution with another; it is asymmetric, zero when distributions are identical, and is the theoretical foundation of cross-entropy loss.