Classification Algorithms — From Logistic Regression to Gradient Boosting
28 min
Classification is the task of assigning inputs to discrete categories. The field is rich with algorithms that make radically different assumptions, have different strengths, and fail in different ways. A practitioner who understands why each algorithm works — not just how to call .fit() — can choose the right tool for each problem and diagnose failures when they occur.
Decision Trees
A decision tree recursively partitions the feature space by choosing splits that maximise information gain. At each node, the algorithm evaluates every feature and every split point, selecting the one that most reduces impurity in the resulting child nodes.
Entropy and Information Gain
Entropy measures the impurity (disorder) of a set S:
H(S) = -sum over classes c of p_c * log2(p_c)
A pure node (all one class) has H = 0. Maximum entropy for a binary problem is H = 1 (equal class proportions).
Information Gain for a split on feature A:
IG(S, A) = H(S) - sum over values v of (|S_v| / |S|) * H(S_v)
The algorithm selects the split that maximises IG.
Gini impurity is an alternative to entropy:
Gini(S) = 1 - sum over c of p_c^2
Gini is slightly faster to compute (no logarithm) and is the default in sklearn.
# Unregularised tree — overfits badlytree_deep = DecisionTreeClassifier(random_state=42)tree_deep.fit(X_train, y_train)print(f"\nDeep tree depth: {tree_deep.get_depth()}")print(f"Train accuracy: {tree_deep.score(X_train, y_train):.4f}")print(f"Test accuracy: {tree_deep.score(X_test, y_test):.4f}")# Regularised treetree_reg = DecisionTreeClassifier( max_depth=6, min_samples_leaf=20, min_samples_split=40, class_weight='balanced', # address imbalance random_state=42)tree_reg.fit(X_train, y_train)print(f"\nRegularised tree depth: {tree_reg.get_depth()}")print(f"Train AUC: {roc_auc_score(y_train, tree_reg.predict_proba(X_train)[:,1]):.4f}")print(f"Test AUC: {roc_auc_score(y_test, tree_reg.predict_proba(X_test)[:,1]):.4f}")# Visualise the first few levelsfig, ax = plt.subplots(figsize=(20, 6))plot_tree(tree_reg, max_depth=3, feature_names=[f'f{i}' for i in range(15)], class_names=['legit', 'fraud'], filled=True, ax=ax, fontsize=8)plt.title('Decision Tree (first 3 levels)')plt.tight_layout()plt.savefig('decision_tree.png', dpi=100, bbox_inches='tight')plt.show()# Hyperparameter: max_depth effect on AUCdepths = range(1, 20)train_aucs, test_aucs = [], []for d in depths: dt = DecisionTreeClassifier(max_depth=d, class_weight='balanced', random_state=42) dt.fit(X_train, y_train) train_aucs.append(roc_auc_score(y_train, dt.predict_proba(X_train)[:,1])) test_aucs.append(roc_auc_score(y_test, dt.predict_proba(X_test)[:,1]))plt.figure(figsize=(8, 4))plt.plot(depths, train_aucs, label='Train AUC', color='steelblue')plt.plot(depths, test_aucs, label='Test AUC', color='coral')plt.xlabel('max_depth')plt.ylabel('AUC-ROC')plt.title('Tree Depth vs AUC: Overfitting Curve')plt.legend()plt.tight_layout()plt.savefig('tree_depth_auc.png', dpi=150, bbox_inches='tight')plt.show()
Random Forest: Bagging with Feature Subsampling
Random Forest extends bagging (bootstrap aggregating) by adding feature randomness. Each tree is trained on a bootstrap sample of the data, and at each split only a random subset of sqrt(n_features) features is considered. This decorrelates the trees — averaging uncorrelated errors reduces variance without increasing bias.
OOB (out-of-bag) error: each bootstrap sample uses about 63.2% of the data. The remaining 37% — never seen by that tree — provides a free validation estimate without a separate validation set.
feature_names = [f'feature_{i}' for i in range(15)]# Impurity-based (mean decrease impurity): fast but biased toward high-cardinality featuresmdi_importance = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)# Permutation importance: shuffle one feature at a time and measure AUC drop# More reliable, especially when cardinality variesperm_imp = permutation_importance(rf, X_test, y_test, scoring='roc_auc', n_repeats=10, random_state=42, n_jobs=-1)perm_importance = pd.Series(perm_imp.importances_mean, index=feature_names).sort_values(ascending=False)fig, axes = plt.subplots(1, 2, figsize=(14, 5))mdi_importance.head(10).plot.barh(ax=axes[0], color='steelblue')axes[0].set_title('Impurity-Based Importance (MDI)\nFast but biased')axes[0].invert_yaxis()perm_importance.head(10).plot.barh(ax=axes[1], color='coral')axes[1].set_title('Permutation Importance\nSlower but unbiased')axes[1].invert_yaxis()plt.tight_layout()plt.savefig('feature_importance_comparison.png', dpi=150, bbox_inches='tight')plt.show()
Gradient Boosting: Additive Residual Learning
Gradient boosting builds an ensemble sequentially. Each new tree fits the pseudo-residuals — the gradient of the loss with respect to the current model's predictions. For MSE loss, pseudo-residuals are simply the residuals (y - y_hat). For log-loss, they are the differences between true labels and predicted probabilities.
The final prediction is the sum of all trees' predictions, each weighted by the learning rate:
F_m(x) = F_(x) + learning_rate * h_m(x)
A smaller learning rate requires more trees but generalises better (shrinkage regularisation).
python
from sklearn.ensemble import GradientBoostingClassifiergbm = GradientBoostingClassifier( n_estimators=300, max_depth=4, learning_rate=0.05, subsample=0.8, # Stochastic GBM: use 80% of data per tree min_samples_leaf=20, random_state=42)gbm.fit(X_train, y_train)# Staged predictions: AUC at each boosting roundtrain_staged = list(gbm.staged_decision_function(X_train))test_staged = list(gbm.staged_decision_function(X_test))train_auc_staged = [roc_auc_score(y_train, s[:,0]) for s in train_staged]test_auc_staged = [roc_auc_score(y_test, s[:,0]) for s in test_staged]plt.figure(figsize=(8, 4))plt.plot(train_auc_staged, label='Train AUC', color='steelblue')plt.plot(test_auc_staged, label='Test AUC', color='coral')plt.xlabel('Boosting round')plt.ylabel('AUC-ROC')plt.title('GBM Learning Curve: Learning Rate-n_estimators Tradeoff')plt.legend()plt.tight_layout()plt.savefig('gbm_learning_curve.png', dpi=150, bbox_inches='tight')plt.show()print(f"GBM Test AUC: {roc_auc_score(y_test, gbm.predict_proba(X_test)[:,1]):.4f}")
XGBoost: Regularised Gradient Boosting
XGBoost extends gradient boosting with explicit regularisation (L1 and L2 on leaf weights and tree complexity), second-order gradient approximations, and highly optimised tree construction. It is typically 2-10x faster than sklearn's GBM and often achieves better generalisation.
python
try: import xgboost as xgb xgb_model = xgb.XGBClassifier( n_estimators=500, max_depth=4, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, # Feature subsampling per tree gamma=1.0, # Minimum loss reduction for split (complexity) reg_alpha=0.1, # L1 on leaf weights reg_lambda=1.0, # L2 on leaf weights scale_pos_weight=y_train.value_counts()[0] / y_train.value_counts()[1] if hasattr(y_train, 'value_counts') else (y_train == 0).sum() / (y_train == 1).sum(), # Imbalance correction early_stopping_rounds=30, eval_metric='auc', use_label_encoder=False, random_state=42, n_jobs=-1 ) eval_set = [(X_train, y_train), (X_test, y_test)] xgb_model.fit(X_train, y_train, eval_set=eval_set, verbose=False) print(f"XGBoost Test AUC: {roc_auc_score(y_test, xgb_model.predict_proba(X_test)[:,1]):.4f}") print(f"Best iteration: {xgb_model.best_iteration}")except ImportError: print("XGBoost not installed. Run: pip install xgboost")
XGBoost Key Parameters
max_depth (3-6 typical): deeper trees learn more complex interactions but overfit faster
learning_rate (0.01-0.3): smaller requires more trees; use with early_stopping_rounds
subsample (0.6-0.9): fraction of training data per tree; adds stochasticity, reduces overfitting
colsample_bytree (0.6-1.0): fraction of features per tree; like Random Forest's max_features
min_child_weight: minimum sum of instance weights in a child node; higher prevents overfitting on small groups
gamma: minimum loss reduction required to make a split; acts like a complexity penalty
scale_pos_weight: ratio of negative to positive instances; critical for imbalanced classification
early_stopping_rounds: stop training when validation metric does not improve for N rounds
LightGBM vs XGBoost
LightGBM (LGBM) uses leaf-wise tree growth: instead of growing all nodes at a given depth (level-wise like XGBoost), it grows the leaf with the maximum loss reduction. This typically produces better accuracy with the same number of leaves, but requires careful tuning of num_leaves and min_data_in_leaf to prevent overfitting.
python
try: import lightgbm as lgb lgbm_model = lgb.LGBMClassifier( n_estimators=500, num_leaves=31, # Max number of leaves — key parameter learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, min_child_samples=20, # min_data_in_leaf — prevents overfit on small leaves class_weight='balanced', random_state=42, n_jobs=-1, verbose=-1 ) lgbm_model.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=[lgb.early_stopping(30, verbose=False), lgb.log_evaluation(period=-1)]) print(f"LightGBM Test AUC: {roc_auc_score(y_test, lgbm_model.predict_proba(X_test)[:,1]):.4f}") # Speed comparison (not run here, but LightGBM is typically 3-10x faster than XGBoost) print("LightGBM is typically 3-10x faster than XGBoost on large datasets") print("due to histogram-based split finding and leaf-wise growth.")except ImportError: print("LightGBM not installed. Run: pip install lightgbm")
Support Vector Machines
SVMs find the hyperplane that maximises the margin — the distance between the hyperplane and the nearest training points (support vectors) from each class. The soft-margin SVM allows some misclassification, controlled by the parameter C:
Large C: small margin tolerance, fewer misclassifications on training data (risk: overfit)
Small C: larger margin, more misclassifications allowed (risk: underfit)
The kernel trick maps data to higher-dimensional spaces without computing the mapping explicitly, allowing nonlinear decision boundaries. The RBF kernel measures Gaussian similarity between points.
SVM shines when: feature count is large relative to sample count (text classification), the data is relatively free of noise, and kernel tricks are needed for nonlinear boundaries. SVM struggles when: n is very large (training is O(n^2) to O(n^3)), class imbalance is severe, or probability estimates (not just predictions) are needed.
Handling Class Imbalance
The 3% fraud rate in our dataset means a model that predicts "no fraud" for everything achieves 97% accuracy. This is useless. Three main strategies exist.
Many classifiers (especially trees and SVMs with probability=True) produce poorly calibrated probabilities. A score of 0.7 from an uncalibrated model does not mean P(positive | score=0.7) = 0.7. Calibration fixes this.
python
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay# Raw decision tree — tends to produce extreme probabilities (0 or 1)dt_uncal = DecisionTreeClassifier(max_depth=5, class_weight='balanced', random_state=42)dt_uncal.fit(X_train, y_train)# Platt scaling (sigmoid calibration — works well for SVM-like models)dt_platt = CalibratedClassifierCV(dt_uncal, method='sigmoid', cv=5)dt_platt.fit(X_train, y_train)# Isotonic regression (non-parametric — more flexible, needs more data)dt_iso = CalibratedClassifierCV( DecisionTreeClassifier(max_depth=5, class_weight='balanced', random_state=42), method='isotonic', cv=5)dt_iso.fit(X_train, y_train)fig, ax = plt.subplots(figsize=(7, 5))for model, name, color in [ (dt_uncal, 'Uncalibrated Tree', 'red'), (dt_platt, 'Platt Scaling', 'steelblue'), (dt_iso, 'Isotonic', 'green'), (rf, 'Random Forest', 'orange'),]: try: proba = model.predict_proba(X_test)[:, 1] fraction_of_positives, mean_predicted = calibration_curve(y_test, proba, n_bins=10) ax.plot(mean_predicted, fraction_of_positives, 'o-', label=name, color=color) except Exception: passax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')ax.set_xlabel('Mean predicted probability')ax.set_ylabel('Fraction of positives')ax.set_title('Calibration Curves (Reliability Diagrams)')ax.legend(fontsize=8)plt.tight_layout()plt.savefig('calibration_curves.png', dpi=150, bbox_inches='tight')plt.show()print("\nCalibration matters when:")print("- You need to rank predictions by probability")print("- Downstream systems use probabilities as inputs")print("- You compare predicted probability to a business threshold")
Key Takeaways
Decision trees split greedily by maximising information gain (entropy) or minimising Gini impurity; they overfit easily and require regularisation via max_depth, min_samples_leaf, and min_samples_split.
Random Forest reduces variance by averaging decorrelated trees (bootstrap sampling plus feature subsampling); OOB error is a free cross-validated accuracy estimate; permutation importance is more reliable than impurity-based importance for feature selection.
Gradient boosting fits pseudo-residuals sequentially; the learning_rate-n_estimators tradeoff is managed by using a small learning rate with early stopping rather than grid searching n_estimators directly.
XGBoost adds L1/L2 regularisation, second-order gradient approximations, and column subsampling; scale_pos_weight and early_stopping_rounds are critical for imbalanced data; LightGBM's leaf-wise growth is faster on large datasets.
SVMs maximise the margin between classes; the C parameter controls the bias-variance tradeoff; the RBF kernel handles nonlinear boundaries; SVMs are best on small-medium, high-dimensional datasets and struggle with large n.
Class imbalance requires explicit handling: class_weight='balanced', SMOTE oversampling, or threshold tuning via the precision-recall curve — never rely on accuracy as the metric.
Probability calibration (Platt scaling or isotonic regression) aligns model scores with true probabilities; uncalibrated scores from trees or SVMs should not be used as probabilities in downstream systems.
Choosing the right algorithm depends on dataset size, feature dimensionality, class balance, whether probabilities are needed, and training time constraints — no single algorithm wins universally.