Experiment Tracking with MLflow
Without systematic experiment tracking, ML development quickly devolves into "I think the model with 256 hidden units was better, but I'm not sure which checkpoint that was." MLflow gives every run a permanent, queryable record: hyperparameters, metrics, artifacts, and environment — all linked to a single run ID you can reproduce weeks later.
Core MLflow Concepts
| Concept | What it stores | Example |
|---|---|---|
| Experiment | A named group of runs | click-rate-prediction |
| Run | One training execution | run_id: a1b2c3d4 |
| Params | Hyperparameters (logged once) | lr=0.001, dropout=0.3 |
| Metrics | Scalar values over steps | val_loss at each epoch |
| Artifacts | Files (models, plots, configs) | best_model.pt, confusion_matrix.png |
| Tags | Free-form key-value labels | team=rec-sys, dataset=v7 |
Instrumenting a Training Run
The with block automatically records run duration, git commit hash (if in a git repo), and status (FINISHED/FAILED).
Comparing Runs Programmatically
Model Registry and Promotion
The registry enforces an explicit promotion gate: a model must be manually (or programmatically) moved from None → Staging → Production. This prevents accidental deployment of untested checkpoints.
Reproducing an Experiment
Summary
- MLflow organises work into experiments → runs, with params, metrics, artifacts, and tags per run.
- Wrap training in
mlflow.start_run()and calllog_paramsonce andlog_metricsper step. - Use
client.search_runs()with filter strings to find the best run without manually scanning the UI. - Register production candidates in the Model Registry and require explicit stage transitions before deployment.
- Reproduce any experiment by loading its logged params and re-running the training script against the same DVC-pinned dataset.