Model Evaluation, Validation & Metrics That Matter
26 min
A model is only as good as the evidence you have that it works. Evaluation is not the last step of a project — it is the mechanism that tells you whether you have a solution at all. The wrong metric, a leaky validation set, or a naive train/test split can produce results that look impressive in a notebook and fail immediately in production. This lesson establishes the rigorous evaluation framework that separates credible work from misleading work.
Why Accuracy Fails
Accuracy (correct predictions / total predictions) is the most intuitive metric and the most dangerous one to use uncritically. Consider a dataset with 99% of samples belonging to class 0. A model that predicts class 0 for everything achieves 99% accuracy. It has learned nothing — it cannot identify a single class-1 case.
False Negative (FN): predicted negative, actually positive (Type II error)
Different problems weight these errors differently. In fraud detection, FN (missed fraud) is catastrophic. In spam filtering, FP (blocking legitimate email) is unacceptable. The confusion matrix makes all four error types visible.
Precision = TP / (TP + FP): of all predicted positives, how many were actually positive? High precision means few false alarms.
Recall (sensitivity) = TP / (TP + FN): of all actual positives, how many did the model find? High recall means few misses.
F1 = 2 * Precision * Recall / (Precision + Recall): the harmonic mean, punishes extreme imbalance between precision and recall.
F-beta generalises F1. With beta > 1, recall is weighted more heavily (beta = 2 means recall counts twice as much as precision — use for medical diagnosis where missing a case is worse than a false alarm). With beta < 1, precision is weighted more heavily.
python
for beta in [0.5, 1.0, 2.0]: fb = fbeta_score(y_te_i, y_pred, beta=beta) print(f"F{beta:.1f}: {fb:.4f} " f"({'Precision-heavy' if beta < 1 else 'Recall-heavy' if beta > 1 else 'Balanced'})")# Multi-class metrics: macro vs weighted vs microX_mc, y_mc = make_classification(n_samples=1000, n_classes=4, n_informative=8, n_features=10, weights=[0.4, 0.3, 0.2, 0.1], random_state=42)X_tr_mc, X_te_mc, y_tr_mc, y_te_mc = train_test_split(X_mc, y_mc, test_size=0.2, random_state=42)lr_mc = LogisticRegression(max_iter=500).fit(X_tr_mc, y_tr_mc)y_pred_mc = lr_mc.predict(X_te_mc)print("\nMulti-class F1 averaging strategies:")for avg in ['macro', 'weighted', 'micro']: f1_mc = f1_score(y_te_mc, y_pred_mc, average=avg) print(f" {avg:10s}: {f1_mc:.4f}")print("macro: unweighted average across classes (treats all classes equally)")print("weighted: average weighted by support (class frequency)")print("micro: aggregate TP/FP/FN across all classes (equivalent to accuracy for balanced)")
ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (Recall) against False Positive Rate at every possible classification threshold. AUC (Area Under the Curve) summarises model discrimination in a single number:
AUC = 0.5: random classifier (diagonal line)
AUC = 1.0: perfect classifier
AUC = P(model scores a random positive higher than a random negative)
This last interpretation is powerful: it is a measure of ranking quality that does not depend on threshold selection.
python
fpr, tpr, thresholds = roc_curve(y_te_i, y_proba)auc_roc = roc_auc_score(y_te_i, y_proba)fig, axes = plt.subplots(1, 2, figsize=(12, 4))axes[0].plot(fpr, tpr, color='steelblue', linewidth=2, label=f'AUC = {auc_roc:.4f}')axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random classifier')axes[0].fill_between(fpr, tpr, alpha=0.1, color='steelblue')axes[0].set_xlabel('False Positive Rate (FPR)')axes[0].set_ylabel('True Positive Rate (TPR = Recall)')axes[0].set_title('ROC Curve')axes[0].legend()# Mark a specific operating pointidx_balanced = np.argmin(np.abs(thresholds - 0.5))axes[0].scatter([fpr[idx_balanced]], [tpr[idx_balanced]], color='red', s=100, zorder=5, label='Threshold=0.5')axes[0].legend()# Annotate what AUC means operationallyaxes[1].axis('off')axes[1].text(0.1, 0.7, f"AUC = {auc_roc:.4f}", fontsize=20, transform=axes[1].transAxes)axes[1].text(0.1, 0.5, f"Interpretation:", fontsize=13, transform=axes[1].transAxes)axes[1].text(0.1, 0.4, f"A randomly chosen positive sample\nranks higher than a randomly chosen\nnegative sample {auc_roc:.1%} of the time.", fontsize=10, transform=axes[1].transAxes)plt.tight_layout()plt.savefig('roc_curve.png', dpi=150, bbox_inches='tight')plt.show()
Precision-Recall Curve and PR-AUC
When positive examples are rare, ROC curves can be optimistic because they include TN in the FPR denominator — a very large pool of negatives makes FPR artificially low. The PR curve plots Precision against Recall and is more informative for imbalanced problems.
PR-AUC (also called Average Precision) equals the area under the PR curve. A random classifier achieves PR-AUC equal to the fraction of positives in the dataset (here ~1%). Prefer PR-AUC over ROC-AUC for severely imbalanced classification.
A well-calibrated model produces probabilities that match empirical frequencies. When the model says P(positive) = 0.7, approximately 70% of those cases should actually be positive. Reliability diagrams (calibration curves) reveal miscalibration.
python
# Compare calibration of different modelsrf_cls = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)rf_cls.fit(X_tr_is, y_tr_i)rf_proba = rf_cls.predict_proba(X_te_is)[:, 1]fig, ax = plt.subplots(figsize=(7, 5))for proba_vals, model_name, color in [ (y_proba, 'Logistic Regression', 'steelblue'), (rf_proba, 'Random Forest', 'coral'),]: frac_pos, mean_pred = calibration_curve(y_te_i, proba_vals, n_bins=10) ax.plot(mean_pred, frac_pos, 'o-', label=model_name, color=color)ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')ax.set_xlabel('Mean predicted probability')ax.set_ylabel('Fraction of positives')ax.set_title('Calibration Curves (Reliability Diagrams)')ax.legend()plt.tight_layout()plt.savefig('calibration_reliability.png', dpi=150, bbox_inches='tight')plt.show()print("Logistic regression is well-calibrated by design (probabilistic model).")print("Random forests tend to produce over-confident probabilities (too extreme).")print("Fix: use CalibratedClassifierCV with method='isotonic' or 'sigmoid'.")
Log Loss: The Proper Scoring Rule
Log loss penalises confident wrong predictions exponentially. A prediction of 0.99 for the wrong class is far worse than a prediction of 0.6 for the wrong class. It is a proper scoring rule — the minimum expected log loss is achieved only by predicting the true probabilities.
python
# Log loss examplesprint("Log loss by prediction confidence:")for true_label, predicted_prob in [(1, 0.9), (1, 0.6), (1, 0.4), (1, 0.1)]: ll = -np.log(predicted_prob) print(f" True=1, Predicted={predicted_prob}: log loss = {ll:.4f}")print()for true_label, predicted_prob in [(0, 0.9), (0, 0.6), (0, 0.4), (0, 0.1)]: ll = -np.log(1 - predicted_prob) print(f" True=0, Predicted(pos)={predicted_prob}: log loss = {ll:.4f}")print(f"\nModel log loss: {log_loss(y_te_i, y_proba):.4f}")print(f"Random predictions log loss: {log_loss(y_te_i, np.full_like(y_proba, y_te_i.mean())):.4f}")
Regression Metrics
For continuous prediction targets, a different set of metrics applies. Each makes different assumptions about the cost of errors.
python
# Generate regression datafrom sklearn.datasets import fetch_california_housinghousing = fetch_california_housing()X_h, y_h = housing.data, housing.targetX_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(X_h, y_h, test_size=0.2, random_state=42)scaler_h = StandardScaler()X_tr_hs = scaler_h.fit_transform(X_tr_h)X_te_hs = scaler_h.transform(X_te_h)lr_h = LogisticRegression() # placeholder — use real modelfrom sklearn.linear_model import Ridgeridge_h = Ridge().fit(X_tr_hs, y_tr_h)y_pred_h = ridge_h.predict(X_te_hs)mae = mean_absolute_error(y_te_h, y_pred_h)mse = mean_squared_error(y_te_h, y_pred_h)rmse = np.sqrt(mse)r2 = r2_score(y_te_h, y_pred_h)adj_r2 = 1 - (1-r2) * (len(y_te_h)-1) / (len(y_te_h)-X_te_hs.shape[1]-1)# MAPE — avoid if y can be near zerononzero_mask = y_te_h > 0.1mape = mean_absolute_percentage_error(y_te_h[nonzero_mask], y_pred_h[nonzero_mask])print("Regression Metrics — California Housing Prices:")print(f" MAE: {mae:.4f} (average absolute error, same units as y)")print(f" RMSE: {rmse:.4f} (penalises large errors more than MAE)")print(f" R²: {r2:.4f} (fraction of variance explained)")print(f" Adjusted R²: {adj_r2:.4f} (penalises extra features — use for model comparison)")print(f" MAPE: {mape:.4f} (percentage error — fails when y near zero)")print("\nWhen to use each:")print(" MAE: when all errors are equally costly")print(" RMSE: when large errors are disproportionately costly (outliers matter)")print(" R²: to compare against a baseline (mean prediction); 0 = baseline, 1 = perfect")print(" Adj R²: comparing models with different numbers of features")print(" MAPE: when proportional error matters more than absolute error")
Cross-Validation Strategies
A single train/test split is fragile — results depend on which samples happened to end up in each set. Cross-validation uses multiple splits and averages the metric, giving more stable estimates.
python
X_cv, y_cv = make_classification(n_samples=800, n_features=10, n_informative=5, random_state=42)model_cv = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)# K-Fold (not stratified — not ideal for classification)kf = KFold(n_splits=5, shuffle=True, random_state=42)kf_scores = cross_val_score(model_cv, X_cv, y_cv, cv=kf, scoring='roc_auc')# Stratified K-Fold (maintains class proportions in each fold — always use for classification)skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)skf_scores = cross_val_score(model_cv, X_cv, y_cv, cv=skf, scoring='roc_auc')print("Cross-Validation AUC Scores:")print(f" KFold: {kf_scores.round(4)} mean={kf_scores.mean():.4f} ± {kf_scores.std():.4f}")print(f" Stratified KFold: {skf_scores.round(4)} mean={skf_scores.mean():.4f} ± {skf_scores.std():.4f}")
Time Series Cross-Validation
For time series data, random splitting leaks future information into training. TimeSeriesSplit always trains on past data and validates on future data.
Data leakage occurs when information from the validation or test set flows into the model during training. It produces optimistically biased evaluation results and models that fail in production. The most common form is preprocessing outside the cross-validation fold.
python
from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import StratifiedKFold, cross_val_scoreX_leak, y_leak = make_classification(n_samples=1000, n_features=20, n_informative=5, random_state=42)# --- WRONG: Leaky implementation ---# StandardScaler is fit on ALL data (including validation folds)scaler_leaky = StandardScaler()X_leak_scaled_wrong = scaler_leaky.fit_transform(X_leak) # Uses test fold statistics!lr_leak = LogisticRegression(max_iter=500)wrong_scores = cross_val_score(lr_leak, X_leak_scaled_wrong, y_leak, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42), scoring='roc_auc')print("LEAKY implementation (scaler fit on all data before CV):")print(f" AUC scores: {wrong_scores.round(4)} mean={wrong_scores.mean():.4f}")# --- CORRECT: Scaler inside Pipeline (fit only on training fold each time) ---correct_pipeline = Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=500))])correct_scores = cross_val_score(correct_pipeline, X_leak, y_leak, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42), scoring='roc_auc')print("\nCORRECT implementation (scaler inside Pipeline):")print(f" AUC scores: {correct_scores.round(4)} mean={correct_scores.mean():.4f}")print(f"\nBias from leakage: {wrong_scores.mean() - correct_scores.mean():.4f}")print("Always put ALL preprocessing steps inside a sklearn Pipeline!")
Nested Cross-Validation for Model Selection
Using the same data for both hyperparameter tuning and final evaluation produces optimistically biased estimates. Nested CV separates these concerns: the inner loop selects hyperparameters; the outer loop evaluates the selected model on held-out data.
python
from sklearn.model_selection import GridSearchCV, cross_val_scoreX_nc, y_nc = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=42)pipe_nc = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(random_state=42))])param_grid = { 'clf__n_estimators': [50, 100], 'clf__max_depth': [3, 5, None],}# Naive (wrong): select best params on all data, then evaluate# This makes test performance optimisticinner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)gs = GridSearchCV(pipe_nc, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1)# Correct nested CV: inner CV selects params, outer CV evaluatesnested_scores = cross_val_score(gs, X_nc, y_nc, cv=outer_cv, scoring='roc_auc', n_jobs=-1)print(f"Nested CV AUC: {nested_scores.round(4)}")print(f"Mean: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")print("\nThis estimate is unbiased: the outer fold data was never used to tune params.")
Choosing the Right Metric
python
print("""Metric Selection Guide:---Classification: Balanced classes + probabilities matter: ROC-AUC + Log Loss Imbalanced classes, positives are rare: PR-AUC (Average Precision) Need a single decision threshold: F1 (balanced) / F2 (recall-heavy) / F0.5 (precision-heavy) Need calibrated probabilities: Log Loss + calibration curveRegression: Outliers in target are meaningful: RMSE All errors equally costly: MAE Proportional errors matter, y != 0: MAPE Comparing models / baselines: R² or Adjusted R²Universal rules: - Use stratified splits for classification - Use TimeSeriesSplit for sequential data - Put ALL preprocessing inside Pipeline to prevent leakage - Use nested CV when performing model selection - Never report a single metric — report multiple complementary metrics""")
Key Takeaways
Accuracy is misleading on imbalanced datasets; a model predicting the majority class achieves high accuracy while finding zero positives — always report precision, recall, and F1 alongside accuracy.
The confusion matrix makes all four error types (TP/FP/TN/FN) explicit; the relative cost of FP vs FN is problem-specific and should drive threshold choice via the precision-recall curve.
ROC-AUC measures ranking quality — the probability that a randomly chosen positive scores higher than a random negative; PR-AUC is more informative when positives are rare.
Calibration curves (reliability diagrams) reveal whether predicted probabilities match empirical frequencies; logistic regression is naturally calibrated; trees and SVMs often require explicit recalibration.
Stratified K-Fold is the standard for classification; TimeSeriesSplit is mandatory for sequential data where future information must never appear in training; use 5-10 folds for stable estimates.
Data leakage — fitting preprocessing on data that includes the validation fold — produces optimistically biased evaluation; always use sklearn Pipeline to ensure preprocessing is fit only on training folds.
Nested cross-validation uses an inner loop for hyperparameter selection and an outer loop for unbiased evaluation; using a single CV loop for both selection and evaluation over-estimates generalisation performance.
For regression, choose MAE when all errors are equally costly, RMSE when large errors are disproportionately bad, R² for baseline comparison, and adjusted R² when comparing models with different feature counts.