Lesson 4

Feature Engineering

14 min

Raw data rarely arrives in a form that machine learning models can consume directly. Categorical strings must be encoded as numbers, skewed distributions must be transformed, and domain knowledge must be crystallised into interaction features. Feature engineering is where practitioner expertise translates into model performance — it is often the single highest-leverage activity in a data science project.

One-Hot Encoding

Most ML models require purely numeric input. One-hot encoding converts a categorical column with k unique values into k binary columns, one per category:

python

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "colour": ["red", "blue", "red", "green", "blue"],
    "size":   ["S", "M", "L", "S", "XL"],
    "price":  [10, 15, 12, 8, 20]
})

# pandas get_dummies
encoded = pd.get_dummies(df, columns=["colour", "size"], drop_first=True)
# drop_first=True removes one category to avoid multicollinearity
print(encoded)

# Scikit-learn OneHotEncoder — preferred for pipelines
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False, drop="first", handle_unknown="ignore")
cat_encoded = ohe.fit_transform(df[["colour", "size"]])
print(ohe.get_feature_names_out(["colour", "size"]))

When to one-hot encode: Low-cardinality columns (< ~20 unique values). For high-cardinality columns (country codes, zip codes with thousands of values), one-hot encoding creates too many features and target encoding is preferable.

Binning (Discretisation)

Converting a continuous variable into ordered buckets captures non-linear relationships and makes the model more robust to outliers:

python

ages = pd.Series([17, 23, 31, 45, 52, 60, 71, 82])

# Equal-width bins (uniform range per bin)
df["age_band_equal"] = pd.cut(ages, bins=4, labels=["0-25", "25-50", "50-75", "75+"])

# Equal-frequency bins (same number of records per bin — quantile-based)
df["age_band_quantile"] = pd.qcut(ages, q=4, labels=["Q1", "Q2", "Q3", "Q4"])

# Custom business-defined bins
bins   = [0, 18, 35, 55, 100]
labels = ["minor", "young_adult", "mid_career", "senior"]
df["age_segment"] = pd.cut(ages, bins=bins, labels=labels, right=False)

Use equal-frequency bins (quantile cuts) when you need balanced class counts per bin. Use custom business-defined bins when domain rules exist (e.g., regulatory age thresholds).

Log and Power Transforms

Right-skewed variables — revenue, page views, house prices — violate the normality assumptions of many models. A log transform compresses the long right tail:

python

import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
revenue = np.random.exponential(scale=5000, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(revenue, bins=50, color="#2196F3")
axes[0].set_title("Revenue (raw — heavily right-skewed)")

axes[1].hist(np.log1p(revenue), bins=50, color="#4CAF50")
axes[1].set_title("log(1 + Revenue) — approximately normal")
plt.tight_layout(); plt.show()

# Apply to DataFrame column
df["log_revenue"] = np.log1p(df["revenue"])

# Box-Cox transform — automatically finds optimal power
from scipy.stats import boxcox
transformed, lambda_val = boxcox(revenue + 1)
print(f"Optimal Box-Cox lambda: {lambda_val:.3f}")

Use np.log1p (log(1 + x)) instead of np.log when the column can contain zeros.

Interaction Features

Interaction terms capture relationships that are not additive — when the effect of one variable depends on another:

python

df = pd.DataFrame({
    "hours_studied": [2, 3, 5, 7, 8],
    "sleep_hours":   [4, 6, 8, 5, 7],
    "test_score":    [55, 65, 82, 75, 89]
})

# Multiplicative interaction
df["study_x_sleep"] = df["hours_studied"] * df["sleep_hours"]

# Ratio feature
df["study_per_sleep"] = df["hours_studied"] / df["sleep_hours"]

# Polynomial features — degree 2 expands (a, b) to (a, b, a², ab, b²)
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[["hours_studied", "sleep_hours"]])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out())
print(poly_df)

Target Encoding

For high-cardinality categorical columns, target encoding replaces each category with the mean of the target variable for that category:

python

df_big = pd.DataFrame({
    "city":   ["NYC", "LA", "NYC", "Chicago", "LA", "NYC", "Chicago"],
    "revenue": [200, 150, 220, 130, 170, 210, 140]
})

# Simple target encoding (training set only — never fit on full data)
target_means = df_big.groupby("city")["revenue"].mean()
df_big["city_encoded"] = df_big["city"].map(target_means)
print(df_big)

Leakage guard: Target encoding must be computed inside cross-validation folds — only on the training portion of each fold. Computing it on the full dataset before splitting leaks target information into the features, producing artificially inflated validation scores. Use scikit-learn's TargetEncoder (sklearn ≥ 1.3) which handles this correctly inside a Pipeline.

The Leakage Guard

Data leakage is the inclusion of information in training features that would not be available at prediction time. It is the most common source of models that look great in validation but fail in production:

| Leakage Type | Example | Fix | |---|---|---| | Target leakage | Including "claim_approved" as a feature when predicting claim fraud | Drop features derived from the target | | Train-test contamination | Fitting a scaler on the full dataset before the split | Always fit transformers only on training data | | Temporal leakage | Using future data to predict the past | Respect chronological splits | | Aggregation leakage | Computing a global mean that includes test rows | Compute aggregations inside cross-validation |

python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# CORRECT: fit scaler only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit AND transform
X_test_scaled  = scaler.transform(X_test)         # only transform — no fit!

Summary

One-hot encoding is the standard approach for low-cardinality categoricals; use drop_first=True to avoid multicollinearity.
Binning with pd.cut or pd.qcut turns continuous variables into ordered categories; quantile bins ensure balanced group sizes.
Log transforms (via np.log1p) are the default fix for right-skewed distributions; Box-Cox automatically finds the optimal power.
Interaction and polynomial features capture non-additive effects but grow the feature space quadratically — use with caution.
Target encoding is powerful for high-cardinality columns but must be computed within cross-validation folds to prevent leakage.
Data leakage — letting future or target information into training features — is the most dangerous and common modelling error.

Hypothesis Testing Linear Regression