Degree 1: high bias (can't fit the cubic), low variance
Degree 3: balanced — roughly matches the true function
Degree 15: low bias, very high variance — memorises each dataset
Learning Curves
Learning curves plot training and validation error as a function of training set size. They are the most direct way to diagnose whether you have a bias or variance problem.
python
from sklearn.model_selection import learning_curvefrom sklearn.linear_model import Ridgefrom sklearn.datasets import make_regressionX, y = make_regression(n_samples=1000, n_features=20, noise=25, random_state=42)def plot_learning_curve(model, X, y, title=""): train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.05, 1.0, 15), cv=5, scoring="neg_mean_squared_error", n_jobs=-1 ) train_mse = -train_scores.mean(axis=1) val_mse = -val_scores.mean(axis=1) train_std = train_scores.std(axis=1) val_std = val_scores.std(axis=1) fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(train_sizes, train_mse, "o-", label="Train MSE", color="#7c3aed") ax.plot(train_sizes, val_mse, "o-", label="Val MSE", color="#06b6d4") ax.fill_between(train_sizes, train_mse - train_std, train_mse + train_std, alpha=0.15, color="#7c3aed") ax.fill_between(train_sizes, val_mse - val_std, val_mse + val_std, alpha=0.15, color="#06b6d4") ax.set_xlabel("Training set size") ax.set_ylabel("MSE") ax.set_title(title) ax.legend() plt.tight_layout() plt.show()# High-bias model (underfit)plot_learning_curve(Ridge(alpha=1e6), X, y, "High Bias: Both errors converge HIGH")# Good modelplot_learning_curve(Ridge(alpha=1.0), X, y, "Good Fit: Low val error, small gap")
Reading the Curves
| Pattern | Diagnosis | Fix |
|---------|-----------|-----|
| Both curves converge HIGH | High bias (underfitting) | More complex model, better features |
| Large gap between train and val | High variance (overfitting) | More data, regularisation, simpler model |
| Val curve still decreasing at right edge | Need more data | Collect more samples |
| Both converge LOW, small gap | Good fit | You're done |
Validation Curves
Validation curves show how training and validation error vary with a single hyperparameter — they reveal the sweet spot between under- and over-fitting.
Ridge adds a penalty proportional to the sum of squared coefficients to the loss function:
Loss = MSE + α × Σβᵢ²
The penalty shrinks all coefficients toward zero but never to exactly zero. It solves the multicollinearity problem by distributing the coefficient mass across correlated features.
python
from sklearn.linear_model import Ridge, RidgeCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.datasets import make_regressionX, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=20, random_state=42)# Always scale before Ridge — penalty is affected by feature scalepipe = Pipeline([ ("scaler", StandardScaler()), ("ridge", Ridge()),])# RidgeCV finds the best alpha via cross-validationalphas = np.logspace(-3, 5, 50)ridge_cv = Pipeline([ ("scaler", StandardScaler()), ("ridge", RidgeCV(alphas=alphas, cv=5, scoring="neg_mean_squared_error")),])ridge_cv.fit(X, y)best_alpha = ridge_cv.named_steps["ridge"].alpha_print(f"Best alpha: {best_alpha:.4f}")# Coefficient shrinkage pathcoefs = []for alpha in alphas: pipe.set_params(ridge__alpha=alpha) pipe.fit(X, y) coefs.append(pipe.named_steps["ridge"].coef_)coefs = np.array(coefs)plt.figure(figsize=(10, 5))for i in range(coefs.shape[1]): plt.plot(alphas, coefs[:, i], alpha=0.4, linewidth=1)plt.axvline(best_alpha, ls="--", color="red", label=f"Best α={best_alpha:.2f}")plt.xscale("log")plt.xlabel("α (regularisation strength)")plt.ylabel("Coefficient value")plt.title("Ridge Coefficient Shrinkage Path")plt.legend()plt.tight_layout()plt.show()
L1 Regularisation (Lasso)
Lasso adds a penalty proportional to the absolute values of coefficients:
Loss = MSE + α × Σ|βᵢ|
The key difference: Lasso drives some coefficients to exactly zero, performing automatic feature selection. This is because the L1 norm has corners at zero — the gradient descent solution "snaps" to zero for unimportant features.
L2 (Ridge) constraint: β₁² + β₂² ≤ r² (circle)L1 (Lasso) constraint: |β₁| + |β₂| ≤ r (diamond / rhombus)The MSE contours are ellipses. The constrained minimum is wherethe first ellipse touches the constraint boundary.For L2: the circle has no corners — the touching point is rarely at an axis.For L1: the diamond HAS corners at (β₁=0, β₂≠0) and (β₁≠0, β₂=0). The touching point is very often at a corner → sparse solution.
Regularisation for Neural Networks: Dropout
Dropout randomly zeroes a fraction p of neurons during each training forward pass. At inference, all neurons are active (scaled by 1-p). This prevents neurons from co-adapting — they can't rely on any specific other neuron being present.
python
import torchimport torch.nn as nnclass RegularisedNet(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.3): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(p=dropout_p), # <-- zero 30% of neurons during training nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(p=dropout_p), nn.Linear(hidden_dim, output_dim), ) def forward(self, x): return self.net(x)# Dropout is automatically disabled during model.eval()model = RegularisedNet(20, 64, 1)model.train() # Dropout active_ = model(torch.randn(32, 20))model.eval() # Dropout disabled_ = model(torch.randn(32, 20))
Early Stopping
For iterative methods (gradient descent, boosting), early stopping halts training when validation error stops improving:
python
from sklearn.ensemble import GradientBoostingRegressorfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorX_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=42)gbm = GradientBoostingRegressor( n_estimators=1000, learning_rate=0.05, max_depth=4, subsample=0.8, random_state=42,)gbm.fit(X_tr, y_tr)# Staged prediction lets us see error at each treeval_errors = [ mean_squared_error(y_val, pred) for pred in gbm.staged_predict(X_val)]best_n = np.argmin(val_errors) + 1print(f"Optimal n_estimators: {best_n} (min val MSE: {val_errors[best_n-1]:.2f})")plt.figure(figsize=(9, 4))plt.plot(val_errors, color="#06b6d4", linewidth=1.5, label="Val MSE")plt.axvline(best_n - 1, ls="--", color="#f87171", label=f"Best={best_n}")plt.xlabel("Number of trees")plt.ylabel("Val MSE")plt.title("Early Stopping — Gradient Boosting")plt.legend()plt.tight_layout()plt.show()
Measure train error and val error:High train error + High val error → High BIAS (underfitting) → Fix: increase model capacity, add features, reduce regularisationLow train error + High val error → High VARIANCE (overfitting) → Fix: add regularisation, reduce complexity, get more data, use dropout/early stoppingBoth errors HIGH and converging → High bias, more data won't help → Fix: better features, more complex modelBoth LOW, large gap → Overfitting → Fix: L1/L2 regularisation, early stopping, more data