GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 14

Machine Learning with scikit-learn

32 min

The ML Workflow

Every machine learning project follows the same arc:

Problem Definition → Data Collection → Preprocessing → Feature Engineering
→ Model Selection → Training → Evaluation → Hyperparameter Tuning → Deployment

scikit-learn covers everything from preprocessing through evaluation in a single, consistent API. Understanding its conventions unlocks the entire ecosystem.

The Estimator Protocol

Every scikit-learn object follows the same interface:

  • fit(X, y) — learn from data
  • predict(X) — make predictions (models)
  • transform(X) — transform data (preprocessors)
  • fit_transform(X, y) — fit then transform in one step

This consistency is the library's superpower. A StandardScaler and a RandomForestClassifier have the same method names. Pipeline composition becomes trivial.

python
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)   # Learn mean/std, then scale
X_test_scaled = scaler.transform(X_test)   # Use learned mean/std (never refit!)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_scaled, y_train)
predictions = clf.predict(X_test_scaled)

The critical rule: fit only on training data, transform both train and test. Fitting on test data is data leakage.

Data Preprocessing

Scaling and Imputation
Click Run to execute — Python runs in your browser via WebAssembly
Encoding and Feature Engineering
Click Run to execute — Python runs in your browser via WebAssembly

Train/Test Split and Cross-Validation

Train/Test Split and Cross-Validation
Click Run to execute — Python runs in your browser via WebAssembly

Models Overview

Model Comparison
Click Run to execute — Python runs in your browser via WebAssembly

Evaluation Metrics

Evaluation Metrics
Click Run to execute — Python runs in your browser via WebAssembly

Pipelines and ColumnTransformer

Pipelines are non-negotiable in production ML. They prevent data leakage and make deployment simple:

Pipeline with ColumnTransformer
Click Run to execute — Python runs in your browser via WebAssembly

Hyperparameter Tuning

GridSearchCV vs RandomizedSearchCV
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Customer Churn Predictor

Customer Churn Predictor — Full Pipeline
Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

  • The estimator protocol (fit/predict/transform) is universal — learn it once, use it everywhere
  • Always use Pipeline — it prevents data leakage and makes deployment a single pickle file
  • ColumnTransformer handles mixed numeric/categorical data in a single, leak-proof transform
  • Cross-validation beats a single train/test split; StratifiedKFold is the default for classification
  • Accuracy is misleading on imbalanced datasets — prefer F1, precision/recall, and ROC-AUC
  • RandomizedSearchCV is usually better than GridSearchCV for large hyperparameter spaces
  • Feature importance from Random Forest gives free feature selection; remove low-importance features to reduce noise
  • Fit preprocessors only on training data, then use transform on test — this is the most common ML bug to avoid