Feature Engineering
Raw data rarely arrives in a form that machine learning models can consume directly. Categorical strings must be encoded as numbers, skewed distributions must be transformed, and domain knowledge must be crystallised into interaction features. Feature engineering is where practitioner expertise translates into model performance — it is often the single highest-leverage activity in a data science project.
One-Hot Encoding
Most ML models require purely numeric input. One-hot encoding converts a categorical column with k unique values into k binary columns, one per category:
When to one-hot encode: Low-cardinality columns (< ~20 unique values). For high-cardinality columns (country codes, zip codes with thousands of values), one-hot encoding creates too many features and target encoding is preferable.
Binning (Discretisation)
Converting a continuous variable into ordered buckets captures non-linear relationships and makes the model more robust to outliers:
Use equal-frequency bins (quantile cuts) when you need balanced class counts per bin. Use custom business-defined bins when domain rules exist (e.g., regulatory age thresholds).
Log and Power Transforms
Right-skewed variables — revenue, page views, house prices — violate the normality assumptions of many models. A log transform compresses the long right tail:
Use np.log1p (log(1 + x)) instead of np.log when the column can contain zeros.
Interaction Features
Interaction terms capture relationships that are not additive — when the effect of one variable depends on another:
Target Encoding
For high-cardinality categorical columns, target encoding replaces each category with the mean of the target variable for that category:
Leakage guard: Target encoding must be computed inside cross-validation folds — only on the training portion of each fold. Computing it on the full dataset before splitting leaks target information into the features, producing artificially inflated validation scores. Use scikit-learn's TargetEncoder (sklearn ≥ 1.3) which handles this correctly inside a Pipeline.
The Leakage Guard
Data leakage is the inclusion of information in training features that would not be available at prediction time. It is the most common source of models that look great in validation but fail in production:
| Leakage Type | Example | Fix | |---|---|---| | Target leakage | Including "claim_approved" as a feature when predicting claim fraud | Drop features derived from the target | | Train-test contamination | Fitting a scaler on the full dataset before the split | Always fit transformers only on training data | | Temporal leakage | Using future data to predict the past | Respect chronological splits | | Aggregation leakage | Computing a global mean that includes test rows | Compute aggregations inside cross-validation |
Summary
- One-hot encoding is the standard approach for low-cardinality categoricals; use
drop_first=Trueto avoid multicollinearity. - Binning with
pd.cutorpd.qcutturns continuous variables into ordered categories; quantile bins ensure balanced group sizes. - Log transforms (via
np.log1p) are the default fix for right-skewed distributions; Box-Cox automatically finds the optimal power. - Interaction and polynomial features capture non-additive effects but grow the feature space quadratically — use with caution.
- Target encoding is powerful for high-cardinality columns but must be computed within cross-validation folds to prevent leakage.
- Data leakage — letting future or target information into training features — is the most dangerous and common modelling error.