Python Mastery — From Zero to AI Engineering
Lesson 14
Machine Learning with scikit-learn
32 min
The ML Workflow
Every machine learning project follows the same arc:
scikit-learn covers everything from preprocessing through evaluation in a single, consistent API. Understanding its conventions unlocks the entire ecosystem.
The Estimator Protocol
Every scikit-learn object follows the same interface:
fit(X, y)— learn from datapredict(X)— make predictions (models)transform(X)— transform data (preprocessors)fit_transform(X, y)— fit then transform in one step
This consistency is the library's superpower. A StandardScaler and a RandomForestClassifier have the same method names. Pipeline composition becomes trivial.
python
The critical rule: fit only on training data, transform both train and test. Fitting on test data is data leakage.
Data Preprocessing
Scaling and Imputation
Click Run to execute — Python runs in your browser via WebAssembly
Encoding and Feature Engineering
Click Run to execute — Python runs in your browser via WebAssembly
Train/Test Split and Cross-Validation
Train/Test Split and Cross-Validation
Click Run to execute — Python runs in your browser via WebAssembly
Models Overview
Model Comparison
Click Run to execute — Python runs in your browser via WebAssembly
Evaluation Metrics
Evaluation Metrics
Click Run to execute — Python runs in your browser via WebAssembly
Pipelines and ColumnTransformer
Pipelines are non-negotiable in production ML. They prevent data leakage and make deployment simple:
Pipeline with ColumnTransformer
Click Run to execute — Python runs in your browser via WebAssembly
Hyperparameter Tuning
GridSearchCV vs RandomizedSearchCV
Click Run to execute — Python runs in your browser via WebAssembly
PROJECT: Customer Churn Predictor
Customer Churn Predictor — Full Pipeline
Click Run to execute — Python runs in your browser via WebAssembly
Key Takeaways
- The estimator protocol (
fit/predict/transform) is universal — learn it once, use it everywhere - Always use
Pipeline— it prevents data leakage and makes deployment a singlepicklefile ColumnTransformerhandles mixed numeric/categorical data in a single, leak-proof transform- Cross-validation beats a single train/test split;
StratifiedKFoldis the default for classification - Accuracy is misleading on imbalanced datasets — prefer F1, precision/recall, and ROC-AUC
RandomizedSearchCVis usually better thanGridSearchCVfor large hyperparameter spaces- Feature importance from Random Forest gives free feature selection; remove low-importance features to reduce noise
- Fit preprocessors only on training data, then use
transformon test — this is the most common ML bug to avoid