Linear Regression
Linear regression is both the simplest and one of the most interpretable supervised learning models. Despite its simplicity, it remains the backbone of much quantitative analysis in finance, economics, and social science — and a conceptually essential stepping stone to understanding more complex models.
Ordinary Least Squares (OLS)
Linear regression models the relationship between a dependent variable y and one or more independent variables X as:
y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε
where β₀ is the intercept, β₁…βₙ are coefficients, and ε is the residual error.
OLS minimises the sum of squared residuals: it finds the β values that make the total squared vertical distance between each data point and the regression line as small as possible.
Interpreting Coefficients
Each coefficient represents the expected change in y for a one-unit increase in that feature, holding all other features constant.
From the example above:
- Intercept ($50,000): The predicted price of a house with 0 sqft and 0 bedrooms — a boundary condition with no practical meaning.
- sqft coefficient ($150): Each additional square foot of space is associated with a $150 increase in price.
- bedrooms coefficient ($30,000): Each additional bedroom is associated with a $30,000 increase, holding sqft constant.
R² (coefficient of determination) measures the proportion of variance in y explained by the model. R² = 0.90 means the model explains 90% of the variation in house prices.
Patterns in residual plots indicate problems:
| Residual Pattern | Implication | |---|---| | Random scatter around 0 | Assumptions satisfied | | Funnel shape (heteroscedasticity) | Variance increases with predicted value; try log(y) | | Curved pattern | A non-linear term is needed | | Systematic clusters | A missing categorical variable |
Ridge and Lasso Regularisation
When a model has many features (or features that are correlated), OLS can overfit — it fits noise in the training data and performs poorly on new data. Regularisation adds a penalty term to the loss function that discourages large coefficient values.
Ridge (L2 regularisation) adds the sum of squared coefficients: Loss = RSS + α × Σβᵢ²
Lasso (L1 regularisation) adds the sum of absolute coefficient values: Loss = RSS + α × Σ|βᵢ|
Lasso's key property is sparsity: it can drive coefficients exactly to zero, performing automatic feature selection.
| Method | Penalty | Effect on coefficients | Use when | |---|---|---|---| | OLS | None | Unpenalised | Few features, n >> p | | Ridge | L2 (sum of squares) | Shrinks toward zero, never exactly | Many correlated features | | Lasso | L1 (sum of absolutes) | Drives some to exactly zero | Feature selection needed | | ElasticNet | L1 + L2 | Combination of both | Many features with groups of correlation |
Tuning the Regularisation Parameter
Summary
- OLS finds the coefficient values that minimise the sum of squared residuals — the ordinary least squares criterion.
- Each coefficient represents the marginal effect of one feature on the target, holding all others constant; always inspect residuals to verify assumptions.
- R² measures the fraction of variance explained; RMSE measures prediction error in the original units.
- Ridge (L2) regularisation shrinks coefficients to reduce overfitting; Lasso (L1) additionally performs feature selection by zeroing out weak predictors.
- Always scale features before applying regularised models — the penalty is applied uniformly and is sensitive to feature magnitude.