Notebook code that runs top-to-bottom once is not a pipeline — it is a script. A real pipeline is a reusable, reproducible object that applies the same transformations to training data, validation data, and future production data in a guaranteed-consistent way. Scikit-learn's Pipeline and ColumnTransformer give you exactly that.
Why Pipelines Matter
Without a pipeline:
You might accidentally fit a scaler on test data (leakage).
Preprocessing code gets copy-pasted and can diverge between training and serving.
Hyperparameter tuning loops become error-prone boilerplate.
Serialising "a model" requires separately serialising and loading multiple objects.
With a pipeline, the entire transformation + estimation chain is a single, serialisable object.
ColumnTransformer applies different transformation sequences to different column subsets — numeric and categorical features require completely different preprocessing:
python
# Numeric pipeline: impute with median → scalenumeric_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])# Categorical pipeline: impute with most frequent → one-hot encodecategorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))])# Combine into a single ColumnTransformerpreprocessor = ColumnTransformer(transformers=[ ("num", numeric_transformer, NUMERIC_FEATURES), ("cat", categorical_transformer, CATEGORICAL_FEATURES)])
Assembling the Full Pipeline
python
# Full pipeline: preprocessor → modelpipeline = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", GradientBoostingClassifier(n_estimators=200, max_depth=3, learning_rate=0.1, random_state=42))])# Train — a single call handles ALL preprocessing and model trainingpipeline.fit(X_train, y_train)# Predict — same pipeline applies preprocessing automaticallyy_pred = pipeline.predict(X_test)y_prob = pipeline.predict_proba(X_test)[:, 1]print(classification_report(y_test, y_pred, target_names=["Died", "Survived"]))print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
Hyperparameter Tuning with GridSearchCV
GridSearchCV searches over a parameter grid using cross-validation. Inside a pipeline, parameters are addressed with the syntax stepname__parametername:
python
param_grid = { "classifier__n_estimators": [100, 200], "classifier__max_depth": [2, 3, 4], "classifier__learning_rate": [0.05, 0.10, 0.20], "preprocessor__num__imputer__strategy": ["mean", "median"]}cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)grid_search = GridSearchCV( pipeline, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1, verbose=1, refit=True # refit the best model on all training data)grid_search.fit(X_train, y_train)print(f"Best parameters: {grid_search.best_params_}")print(f"Best CV AUC: {grid_search.best_score_:.4f}")best_pipeline = grid_search.best_estimator_y_prob_best = best_pipeline.predict_proba(X_test)[:, 1]print(f"Test AUC: {roc_auc_score(y_test, y_prob_best):.4f}")
Inspecting Feature Importance Through the Pipeline
python
# Access the fitted model inside the pipelinebest_clf = best_pipeline.named_steps["classifier"]# Recover feature names from the preprocessorcat_feature_names = ( best_pipeline.named_steps["preprocessor"] .named_transformers_["cat"] .named_steps["encoder"] .get_feature_names_out(CATEGORICAL_FEATURES))all_feature_names = NUMERIC_FEATURES + list(cat_feature_names)importance_df = ( pd.DataFrame({ "feature": all_feature_names, "importance": best_clf.feature_importances_ }) .sort_values("importance", ascending=False) .head(10))print(importance_df.to_string(index=False))
A trained pipeline is serialised with joblib — a single file contains the entire preprocessing chain plus the model:
python
# Save to diskjoblib.dump(best_pipeline, "titanic_pipeline_v1.pkl")# Load and use — no code changes neededloaded_pipeline = joblib.load("titanic_pipeline_v1.pkl")# The loaded pipeline is ready to predict on raw, unprocessed datanew_passenger = pd.DataFrame([{ "age": 28, "fare": 52.0, "sibsp": 0, "parch": 0, "pclass": "1", "sex": "female", "embarked": "S"}])prob = loaded_pipeline.predict_proba(new_passenger)[0, 1]print(f"Survival probability: {prob:.3f}")
Complete Pipeline Checklist
Before shipping any pipeline to production, verify:
Leakage audit: All transformers are fitted only on training data (fit_transform on train, transform on test).
Unknown category handling: OneHotEncoder(handle_unknown="ignore") prevents failures on unseen categories.
Null robustness: SimpleImputer is present for every column that can contain nulls.
Version pinning: Log the scikit-learn version alongside the saved model file.
Input validation: Assert that the input DataFrame has the expected columns and dtypes before predict().
Performance baseline: Compare against a trivial model (always predict majority class) to confirm the pipeline adds value.
Summary
A scikit-learn Pipeline chains preprocessing and modelling steps into a single object that prevents leakage and is serialisable with joblib.
ColumnTransformer applies different transformation sequences to numeric and categorical columns simultaneously.
Parameters within a pipeline are tuned via GridSearchCV using the stepname__parametername syntax; refit=True trains the final model on all training data.
The trained pipeline applies the same preprocessing to any new DataFrame automatically — no separate transformation code required at serving time.
Always inspect feature importances and validate on the held-out test set exactly once, after all tuning is complete, to report an honest generalisation estimate.