Data Science Fundamentals — From Theory to Production Models

Lesson 7

Model Evaluation, Validation & Metrics That Matter

26 min

A model is only as good as the evidence you have that it works. Evaluation is not the last step of a project — it is the mechanism that tells you whether you have a solution at all. The wrong metric, a leaky validation set, or a naive train/test split can produce results that look impressive in a notebook and fail immediately in production. This lesson establishes the rigorous evaluation framework that separates credible work from misleading work.

Why Accuracy Fails

Accuracy (correct predictions / total predictions) is the most intuitive metric and the most dangerous one to use uncritically. Consider a dataset with 99% of samples belonging to class 0. A model that predicts class 0 for everything achieves 99% accuracy. It has learned nothing — it cannot identify a single class-1 case.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold, cross_val_score,
    cross_validate, TimeSeriesSplit
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score, confusion_matrix, ConfusionMatrixDisplay,
    precision_score, recall_score, f1_score, fbeta_score,
    roc_auc_score, roc_curve, average_precision_score,
    precision_recall_curve, log_loss,
    mean_absolute_error, mean_squared_error, r2_score,
    mean_absolute_percentage_error
)
from sklearn.calibration import calibration_curve
import warnings
warnings.filterwarnings('ignore')

rng = np.random.default_rng(42)

# 99/1 class imbalance dataset
X_imb, y_imb = make_classification(
    n_samples=10_000, n_features=10, weights=[0.99, 0.01],
    n_informative=5, random_state=42
)

X_tr_i, X_te_i, y_tr_i, y_te_i = train_test_split(
    X_imb, y_imb, test_size=0.2, stratify=y_imb, random_state=42
)

# The "always predict majority class" dummy
always_zero = np.zeros_like(y_te_i)
print("Dummy classifier (always predicts class 0):")
print(f"  Accuracy: {accuracy_score(y_te_i, always_zero):.4f}  <-- looks great!")
print(f"  Recall:   {recall_score(y_te_i, always_zero):.4f}   <-- finds zero positives")
print(f"  Precision:{precision_score(y_te_i, always_zero, zero_division=0):.4f}")
print(f"  F1:       {f1_score(y_te_i, always_zero):.4f}")

The Confusion Matrix

Every classification prediction falls into one of four categories:

True Positive (TP): predicted positive, actually positive
False Positive (FP): predicted positive, actually negative (Type I error)
True Negative (TN): predicted negative, actually negative
False Negative (FN): predicted negative, actually positive (Type II error)

Different problems weight these errors differently. In fraud detection, FN (missed fraud) is catastrophic. In spam filtering, FP (blocking legitimate email) is unacceptable. The confusion matrix makes all four error types visible.

python

# Train a real model
scaler = StandardScaler()
X_tr_is = scaler.fit_transform(X_tr_i)
X_te_is  = scaler.transform(X_te_i)

lr = LogisticRegression(class_weight='balanced', C=1.0, max_iter=500)
lr.fit(X_tr_is, y_tr_i)
y_pred  = lr.predict(X_te_is)
y_proba = lr.predict_proba(X_te_is)[:, 1]

cm = confusion_matrix(y_te_i, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nConfusion Matrix:")
print(f"  TN={tn:4d}  FP={fp:4d}")
print(f"  FN={fn:4d}  TP={tp:4d}")
print(f"\nDerived from confusion matrix:")
print(f"  Accuracy:  (TP+TN)/(TP+TN+FP+FN) = {(tp+tn)/(tp+tn+fp+fn):.4f}")
print(f"  Precision: TP/(TP+FP)             = {tp/(tp+fp):.4f}")
print(f"  Recall:    TP/(TP+FN)             = {tp/(tp+fn):.4f}")
print(f"  F1:        2*P*R/(P+R)            = {2*tp/(2*tp+fp+fn):.4f}")

fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay(cm, display_labels=['Legit', 'Fraud']).plot(ax=ax, cmap='Blues')
ax.set_title('Confusion Matrix')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

Precision, Recall, F1, and F-beta

Precision = TP / (TP + FP): of all predicted positives, how many were actually positive? High precision means few false alarms.

Recall (sensitivity) = TP / (TP + FN): of all actual positives, how many did the model find? High recall means few misses.

F1 = 2 * Precision * Recall / (Precision + Recall): the harmonic mean, punishes extreme imbalance between precision and recall.

F-beta generalises F1. With beta > 1, recall is weighted more heavily (beta = 2 means recall counts twice as much as precision — use for medical diagnosis where missing a case is worse than a false alarm). With beta < 1, precision is weighted more heavily.

python

for beta in [0.5, 1.0, 2.0]:
    fb = fbeta_score(y_te_i, y_pred, beta=beta)
    print(f"F{beta:.1f}: {fb:.4f}  "
          f"({'Precision-heavy' if beta < 1 else 'Recall-heavy' if beta > 1 else 'Balanced'})")

# Multi-class metrics: macro vs weighted vs micro
X_mc, y_mc = make_classification(n_samples=1000, n_classes=4,
                                  n_informative=8, n_features=10,
                                  weights=[0.4, 0.3, 0.2, 0.1],
                                  random_state=42)
X_tr_mc, X_te_mc, y_tr_mc, y_te_mc = train_test_split(X_mc, y_mc, test_size=0.2, random_state=42)
lr_mc = LogisticRegression(max_iter=500).fit(X_tr_mc, y_tr_mc)
y_pred_mc = lr_mc.predict(X_te_mc)

print("\nMulti-class F1 averaging strategies:")
for avg in ['macro', 'weighted', 'micro']:
    f1_mc = f1_score(y_te_mc, y_pred_mc, average=avg)
    print(f"  {avg:10s}: {f1_mc:.4f}")
print("macro:    unweighted average across classes (treats all classes equally)")
print("weighted: average weighted by support (class frequency)")
print("micro:    aggregate TP/FP/FN across all classes (equivalent to accuracy for balanced)")

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (Recall) against False Positive Rate at every possible classification threshold. AUC (Area Under the Curve) summarises model discrimination in a single number:

AUC = 0.5: random classifier (diagonal line)
AUC = 1.0: perfect classifier
AUC = P(model scores a random positive higher than a random negative)

This last interpretation is powerful: it is a measure of ranking quality that does not depend on threshold selection.

python

fpr, tpr, thresholds = roc_curve(y_te_i, y_proba)
auc_roc = roc_auc_score(y_te_i, y_proba)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(fpr, tpr, color='steelblue', linewidth=2, label=f'AUC = {auc_roc:.4f}')
axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random classifier')
axes[0].fill_between(fpr, tpr, alpha=0.1, color='steelblue')
axes[0].set_xlabel('False Positive Rate (FPR)')
axes[0].set_ylabel('True Positive Rate (TPR = Recall)')
axes[0].set_title('ROC Curve')
axes[0].legend()

# Mark a specific operating point
idx_balanced = np.argmin(np.abs(thresholds - 0.5))
axes[0].scatter([fpr[idx_balanced]], [tpr[idx_balanced]],
                color='red', s=100, zorder=5, label='Threshold=0.5')
axes[0].legend()

# Annotate what AUC means operationally
axes[1].axis('off')
axes[1].text(0.1, 0.7, f"AUC = {auc_roc:.4f}", fontsize=20, transform=axes[1].transAxes)
axes[1].text(0.1, 0.5, f"Interpretation:", fontsize=13, transform=axes[1].transAxes)
axes[1].text(0.1, 0.4, f"A randomly chosen positive sample\nranks higher than a randomly chosen\nnegative sample {auc_roc:.1%} of the time.",
             fontsize=10, transform=axes[1].transAxes)

plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150, bbox_inches='tight')
plt.show()

Precision-Recall Curve and PR-AUC

When positive examples are rare, ROC curves can be optimistic because they include TN in the FPR denominator — a very large pool of negatives makes FPR artificially low. The PR curve plots Precision against Recall and is more informative for imbalanced problems.

PR-AUC (also called Average Precision) equals the area under the PR curve. A random classifier achieves PR-AUC equal to the fraction of positives in the dataset (here ~1%). Prefer PR-AUC over ROC-AUC for severely imbalanced classification.

python

precisions, recalls, pr_thresholds = precision_recall_curve(y_te_i, y_proba)
pr_auc = average_precision_score(y_te_i, y_proba)
positive_rate = y_te_i.mean()

fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(recalls, precisions, color='steelblue', linewidth=2, label=f'PR-AUC = {pr_auc:.4f}')
ax.axhline(positive_rate, color='red', linestyle='--',
           label=f'Random classifier (PR-AUC ≈ {positive_rate:.3f})')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curve\n(prefer over ROC when positives are rare)')
ax.legend()
plt.tight_layout()
plt.savefig('pr_curve.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Class balance: {positive_rate:.2%} positive")
print(f"ROC-AUC:  {auc_roc:.4f}")
print(f"PR-AUC:   {pr_auc:.4f}  (random = {positive_rate:.4f})")

Calibration Curves

A well-calibrated model produces probabilities that match empirical frequencies. When the model says P(positive) = 0.7, approximately 70% of those cases should actually be positive. Reliability diagrams (calibration curves) reveal miscalibration.

python

# Compare calibration of different models
rf_cls = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_cls.fit(X_tr_is, y_tr_i)
rf_proba = rf_cls.predict_proba(X_te_is)[:, 1]

fig, ax = plt.subplots(figsize=(7, 5))

for proba_vals, model_name, color in [
    (y_proba,  'Logistic Regression', 'steelblue'),
    (rf_proba, 'Random Forest',       'coral'),
]:
    frac_pos, mean_pred = calibration_curve(y_te_i, proba_vals, n_bins=10)
    ax.plot(mean_pred, frac_pos, 'o-', label=model_name, color=color)

ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction of positives')
ax.set_title('Calibration Curves (Reliability Diagrams)')
ax.legend()
plt.tight_layout()
plt.savefig('calibration_reliability.png', dpi=150, bbox_inches='tight')
plt.show()

print("Logistic regression is well-calibrated by design (probabilistic model).")
print("Random forests tend to produce over-confident probabilities (too extreme).")
print("Fix: use CalibratedClassifierCV with method='isotonic' or 'sigmoid'.")

Log Loss: The Proper Scoring Rule

Log loss penalises confident wrong predictions exponentially. A prediction of 0.99 for the wrong class is far worse than a prediction of 0.6 for the wrong class. It is a proper scoring rule — the minimum expected log loss is achieved only by predicting the true probabilities.

python

# Log loss examples
print("Log loss by prediction confidence:")
for true_label, predicted_prob in [(1, 0.9), (1, 0.6), (1, 0.4), (1, 0.1)]:
    ll = -np.log(predicted_prob)
    print(f"  True=1, Predicted={predicted_prob}: log loss = {ll:.4f}")

print()
for true_label, predicted_prob in [(0, 0.9), (0, 0.6), (0, 0.4), (0, 0.1)]:
    ll = -np.log(1 - predicted_prob)
    print(f"  True=0, Predicted(pos)={predicted_prob}: log loss = {ll:.4f}")

print(f"\nModel log loss: {log_loss(y_te_i, y_proba):.4f}")
print(f"Random predictions log loss: {log_loss(y_te_i, np.full_like(y_proba, y_te_i.mean())):.4f}")

Regression Metrics

For continuous prediction targets, a different set of metrics applies. Each makes different assumptions about the cost of errors.

python

# Generate regression data
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
X_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(X_h, y_h, test_size=0.2, random_state=42)

scaler_h = StandardScaler()
X_tr_hs = scaler_h.fit_transform(X_tr_h)
X_te_hs  = scaler_h.transform(X_te_h)

lr_h = LogisticRegression()  # placeholder — use real model
from sklearn.linear_model import Ridge
ridge_h = Ridge().fit(X_tr_hs, y_tr_h)
y_pred_h = ridge_h.predict(X_te_hs)

mae  = mean_absolute_error(y_te_h, y_pred_h)
mse  = mean_squared_error(y_te_h, y_pred_h)
rmse = np.sqrt(mse)
r2   = r2_score(y_te_h, y_pred_h)
adj_r2 = 1 - (1-r2) * (len(y_te_h)-1) / (len(y_te_h)-X_te_hs.shape[1]-1)

# MAPE — avoid if y can be near zero
nonzero_mask = y_te_h > 0.1
mape = mean_absolute_percentage_error(y_te_h[nonzero_mask], y_pred_h[nonzero_mask])

print("Regression Metrics — California Housing Prices:")
print(f"  MAE:          {mae:.4f}  (average absolute error, same units as y)")
print(f"  RMSE:         {rmse:.4f}  (penalises large errors more than MAE)")
print(f"  R²:           {r2:.4f}  (fraction of variance explained)")
print(f"  Adjusted R²:  {adj_r2:.4f}  (penalises extra features — use for model comparison)")
print(f"  MAPE:         {mape:.4f}  (percentage error — fails when y near zero)")

print("\nWhen to use each:")
print("  MAE:  when all errors are equally costly")
print("  RMSE: when large errors are disproportionately costly (outliers matter)")
print("  R²:   to compare against a baseline (mean prediction); 0 = baseline, 1 = perfect")
print("  Adj R²: comparing models with different numbers of features")
print("  MAPE: when proportional error matters more than absolute error")

Cross-Validation Strategies

A single train/test split is fragile — results depend on which samples happened to end up in each set. Cross-validation uses multiple splits and averages the metric, giving more stable estimates.

python

X_cv, y_cv = make_classification(n_samples=800, n_features=10,
                                  n_informative=5, random_state=42)
model_cv = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# K-Fold (not stratified — not ideal for classification)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model_cv, X_cv, y_cv, cv=kf, scoring='roc_auc')

# Stratified K-Fold (maintains class proportions in each fold — always use for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(model_cv, X_cv, y_cv, cv=skf, scoring='roc_auc')

print("Cross-Validation AUC Scores:")
print(f"  KFold:            {kf_scores.round(4)}  mean={kf_scores.mean():.4f} ± {kf_scores.std():.4f}")
print(f"  Stratified KFold: {skf_scores.round(4)}  mean={skf_scores.mean():.4f} ± {skf_scores.std():.4f}")

Time Series Cross-Validation

For time series data, random splitting leaks future information into training. TimeSeriesSplit always trains on past data and validates on future data.

python

# Simulate time-series data
n_ts = 500
t = np.arange(n_ts)
signal = np.sin(t / 20) + 0.5 * np.sin(t / 7) + rng.normal(0, 0.3, n_ts)
X_ts = np.column_stack([np.roll(signal, lag) for lag in range(1, 6)])  # Lag features
y_ts = (signal > signal.mean()).astype(int)

tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(LogisticRegression(max_iter=500), X_ts[5:], y_ts[5:],
                             cv=tscv, scoring='roc_auc')

print(f"\nTimeSeriesSplit AUC: {ts_scores.round(4)}  mean={ts_scores.mean():.4f}")

# Visualise the splits
fig, ax = plt.subplots(figsize=(10, 4))
for fold_i, (tr_idx, te_idx) in enumerate(tscv.split(X_ts[5:])):
    ax.scatter(tr_idx, [fold_i + 0.1] * len(tr_idx), marker='_', s=3, color='steelblue', alpha=0.6)
    ax.scatter(te_idx, [fold_i + 0.1] * len(te_idx), marker='_', s=3, color='coral', alpha=0.9)

from matplotlib.patches import Patch
ax.legend(handles=[Patch(color='steelblue', label='Train'),
                   Patch(color='coral', label='Validation')], loc='lower right')
ax.set_xlabel('Time index')
ax.set_ylabel('Fold')
ax.set_title('TimeSeriesSplit — Training always precedes Validation')
plt.tight_layout()
plt.savefig('time_series_cv.png', dpi=150, bbox_inches='tight')
plt.show()

The Leakage Problem

Data leakage occurs when information from the validation or test set flows into the model during training. It produces optimistically biased evaluation results and models that fail in production. The most common form is preprocessing outside the cross-validation fold.

python

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score

X_leak, y_leak = make_classification(n_samples=1000, n_features=20,
                                      n_informative=5, random_state=42)

# --- WRONG: Leaky implementation ---
# StandardScaler is fit on ALL data (including validation folds)
scaler_leaky = StandardScaler()
X_leak_scaled_wrong = scaler_leaky.fit_transform(X_leak)  # Uses test fold statistics!
lr_leak = LogisticRegression(max_iter=500)
wrong_scores = cross_val_score(lr_leak, X_leak_scaled_wrong, y_leak,
                                cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                                scoring='roc_auc')
print("LEAKY implementation (scaler fit on all data before CV):")
print(f"  AUC scores: {wrong_scores.round(4)}  mean={wrong_scores.mean():.4f}")

# --- CORRECT: Scaler inside Pipeline (fit only on training fold each time) ---
correct_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=500))
])
correct_scores = cross_val_score(correct_pipeline, X_leak, y_leak,
                                  cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                                  scoring='roc_auc')
print("\nCORRECT implementation (scaler inside Pipeline):")
print(f"  AUC scores: {correct_scores.round(4)}  mean={correct_scores.mean():.4f}")
print(f"\nBias from leakage: {wrong_scores.mean() - correct_scores.mean():.4f}")
print("Always put ALL preprocessing steps inside a sklearn Pipeline!")

Nested Cross-Validation for Model Selection

Using the same data for both hyperparameter tuning and final evaluation produces optimistically biased estimates. Nested CV separates these concerns: the inner loop selects hyperparameters; the outer loop evaluates the selected model on held-out data.

python

from sklearn.model_selection import GridSearchCV, cross_val_score

X_nc, y_nc = make_classification(n_samples=500, n_features=10,
                                   n_informative=5, random_state=42)

pipe_nc = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'clf__n_estimators': [50, 100],
    'clf__max_depth': [3, 5, None],
}

# Naive (wrong): select best params on all data, then evaluate
# This makes test performance optimistic
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv  = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gs = GridSearchCV(pipe_nc, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1)

# Correct nested CV: inner CV selects params, outer CV evaluates
nested_scores = cross_val_score(gs, X_nc, y_nc, cv=outer_cv, scoring='roc_auc', n_jobs=-1)
print(f"Nested CV AUC: {nested_scores.round(4)}")
print(f"Mean: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
print("\nThis estimate is unbiased: the outer fold data was never used to tune params.")

Choosing the Right Metric

python

print("""
Metric Selection Guide:
---
Classification:
  Balanced classes + probabilities matter:  ROC-AUC + Log Loss
  Imbalanced classes, positives are rare:   PR-AUC (Average Precision)
  Need a single decision threshold:          F1 (balanced) / F2 (recall-heavy) / F0.5 (precision-heavy)
  Need calibrated probabilities:            Log Loss + calibration curve

Regression:
  Outliers in target are meaningful:         RMSE
  All errors equally costly:                 MAE
  Proportional errors matter, y != 0:        MAPE
  Comparing models / baselines:              R² or Adjusted R²

Universal rules:
  - Use stratified splits for classification
  - Use TimeSeriesSplit for sequential data
  - Put ALL preprocessing inside Pipeline to prevent leakage
  - Use nested CV when performing model selection
  - Never report a single metric — report multiple complementary metrics
""")

Key Takeaways

Accuracy is misleading on imbalanced datasets; a model predicting the majority class achieves high accuracy while finding zero positives — always report precision, recall, and F1 alongside accuracy.
The confusion matrix makes all four error types (TP/FP/TN/FN) explicit; the relative cost of FP vs FN is problem-specific and should drive threshold choice via the precision-recall curve.
ROC-AUC measures ranking quality — the probability that a randomly chosen positive scores higher than a random negative; PR-AUC is more informative when positives are rare.
Calibration curves (reliability diagrams) reveal whether predicted probabilities match empirical frequencies; logistic regression is naturally calibrated; trees and SVMs often require explicit recalibration.
Stratified K-Fold is the standard for classification; TimeSeriesSplit is mandatory for sequential data where future information must never appear in training; use 5-10 folds for stable estimates.
Data leakage — fitting preprocessing on data that includes the validation fold — produces optimistically biased evaluation; always use sklearn Pipeline to ensure preprocessing is fit only on training folds.
Nested cross-validation uses an inner loop for hyperparameter selection and an outer loop for unbiased evaluation; using a single CV loop for both selection and evaluation over-estimates generalisation performance.
For regression, choose MAE when all errors are equally costly, RMSE when large errors are disproportionately bad, R² for baseline comparison, and adjusted R² when comparing models with different feature counts.

Unsupervised Learning — Clustering & Dimensionality Reduction Bias-Variance Tradeoff, Overfitting & Regularisation