Reproducible ML Pipeline with DVC + MLflow on DagsHub

From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

In today’s machine learning job market, a single Jupyter notebook is no longer a compelling artifact. While notebooks remain useful for exploration, hiring managers evaluating machine learning engineers are looking for something much more substantial: reproducibility, experiment traceability, environment isolation, artifact management, and evidence that the candidate understands the difference between experimentation and engineering. The ability to move from raw data to a structured, versioned, repeatable experiment pipeline is now a baseline expectation.

This article walks through a compact yet production-lean machine learning pipeline designed to demonstrate exactly those capabilities. The goal is not to build the most complex model, but to build the most structurally sound workflow. We will combine DVC for data and pipeline versioning, MLflow for experiment tracking, scikit-learn for modeling, and DagsHub as a remote collaboration and tracking hub. By the end, we will have a system that transforms a raw CSV file into a reproducible experiment complete with hyperparameter search, metrics logging, model artifacts, and remote traceability — all structured in a way that scales beyond toy experimentation.

The emphasis here is engineering discipline. The dataset is simple. The architecture is what matters.

What we’ll build

We will construct a simple yet fully structured machine learning pipeline using the Pima Indians Diabetes dataset, where the target variable `Outcome` represents a binary classification problem (0 or 1). While the modeling task is straightforward, the pipeline architecture is intentionally clean and modular.

params.yaml
README.md
requirements.txt

data/
  raw/
    data.csv
  processed/
    data.csv

src/
  preprocess.py
  train.py
  evaluate.py

models/
  model.pkl   # generated

This structure is deliberate. Data is separated into raw and processed layers to prevent accidental contamination of original inputs. Source code is isolated under `src/`, and model artifacts are written to a dedicated directory. Parameters are centralized in `params.yaml` to ensure that configuration changes are tracked independently of code modifications. This separation of concerns is the foundation of reproducible machine learning.

The pipeline consists of three stages:

* Preprocess: ingest raw CSV data and produce a processed dataset.

* Train: perform a reproducible train/test split, conduct hyperparameter search over a RandomForest classifier, log metrics and artifacts to MLflow, and serialize the trained model.

* Evaluate: reload the trained model, compute evaluation metrics, and log evaluation results for traceability.

Each stage can be executed independently, but when combined with DVC, they form a declarative, dependency-tracked workflow that can be re-run automatically whenever inputs change.

This design mirrors real-world MLOps systems in miniature.

Tech stack

The stack is intentionally pragmatic and widely adopted in industry:

* Python 3.12 (compatible with 3.9+)

* scikit-learn for modeling

* DVC for data versioning and pipeline orchestration

* MLflow for experiment tracking

* DagsHub as a hosted remote for MLflow and optional DVC storage

This combination represents a modern baseline for reproducible ML experimentation without requiring heavy infrastructure investment.

Install dependencies in an isolated environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The isolated virtual environment ensures dependency reproducibility. In production contexts, this would be extended with Docker or CI environment pinning, but for this compact pipeline, a properly versioned `requirements.txt` suffices.

Parameters

A central principle of reproducible experimentation is that configuration must be externalized from code. Hardcoding hyperparameters inside training scripts makes experiment comparison opaque and error-prone. Instead, we define parameters declaratively in `params.yaml`:

preprocess:
  input: data/raw/data.csv
  output: data/processed/data.csv

train:
  data: data/processed/data.csv
  model: models/model.pkl
  random_state: 42
  n_estimators: 100
  max_depth: 5

This file becomes the canonical source of truth for experiment configuration. When hyperparameters change, DVC can detect the modification and selectively re-run dependent stages. Combined with Git version control, every experiment becomes traceable not only by metrics but by exact configuration state.

This approach scales naturally into more complex projects where feature engineering, model architecture, or training schedules vary across runs.

Stage 1 — Preprocess

The preprocessing stage is intentionally minimal, but structurally important. Even when transformation is simple, isolating it into its own stage enforces discipline and prepares the pipeline for future expansion.

# src/preprocess.py
import pandas as pd
import yaml
import os

params = yaml.safe_load(open("params.yaml"))["preprocess"]

def preprocess(input_path, output_path):
    data = pd.read_csv(input_path)
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    data.to_csv(output_path)
    print(f"Preprocessed data saved to {output_path}")

if __name__ == "__main__":
    preprocess(params["input"], params["output"])

Although this example merely copies the CSV, in production scenarios this stage would include imputations, scaling, encoding categorical variables, and schema validation. By separating preprocessing into its own executable unit, we prevent leakage of transformation logic into training scripts and make data lineage explicit.

Run locally:

python src/preprocess.py

Later, when declared as a DVC stage, this step becomes dependency-tracked and automatically re-executed when input files or parameters change.

Stage 2 — Train with hyperparameter search and MLflow tracking

The training stage demonstrates where engineering discipline truly matters. Instead of a naive fit-and-print script, we implement:

* Deterministic train/test splitting

* GridSearchCV for hyperparameter exploration

* Logging of parameters and metrics

* Artifact tracking

* Model serialization

Credential configuration should never be hardcoded:

export MLFLOW_TRACKING_URI="https://dagshub.com/<user>/<repo>.mlflow"
export MLFLOW_TRACKING_USERNAME="<your_username>"
export MLFLOW_TRACKING_PASSWORD="<your_token>"

This ensures secure connection to a remote MLflow tracking server hosted on DagsHub. Every run is logged remotely, enabling collaboration, auditability, and experiment comparison across machines and team members.

On typical runs, accuracy for this dataset ranges between 0.75 and 0.80 depending on the split. However, the raw metric is less important than the fact that:

* Hyperparameters are logged.

* Artifacts are preserved.

* Runs are comparable in a centralized UI.

This is the difference between experimentation and engineering.

Stage 3 — Evaluate

Evaluation is intentionally separated from training. While some pipelines combine these steps, isolating evaluation enforces clarity of responsibility.

# src/evaluate.py (core excerpts)
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score
import yaml
import os
import mlflow

params = yaml.safe_load(open("params.yaml"))["train"]

def evaluate(data_path, model_path):
    data = pd.read_csv(data_path)
    X = data.drop(columns=["Outcome"])
    y = data["Outcome"]

    mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

    with open(model_path, "rb") as f:
        model = pickle.load(f)

    y_pred = model.predict(X)
    accuracy = accuracy_score(y, y_pred)
    mlflow.log_metric("evaluation_accuracy", accuracy)
    print(f"Evaluation Accuracy: {accuracy}")

if __name__ == "__main__":
    evaluate(params["data"], params["model"])

By logging evaluation metrics separately, we preserve clarity in experiment lifecycle stages. In more advanced systems, this stage might evaluate against a hold-out test set or a temporal validation slice to avoid data leakage.

Running locally

Execution flow:

python src/preprocess.py
python src/train.py
python src/evaluate.py

This linear workflow demonstrates modularity. Each script performs a single responsibility and can be tested independently.

Optional: DVC stages for full reproducibility

DVC transforms this pipeline into a fully declarative DAG.

dvc stage add -n preprocess \
  -p preprocess.input,preprocess.output \
  -d src/preprocess.py -d data/raw/data.csv \
  -o data/processed/data.csv \
  python src/preprocess.py

Subsequent stages follow similarly.

dvc repro

Now, whenever:

* Code changes

* Data changes

* Parameters change

DVC intelligently re-runs only affected stages. This selective recomputation is critical for scaling experiments efficiently.

Large artifacts such as datasets and models can be pushed to remote storage while keeping the Git repository lightweight.

A common gotcha: MLflow Model Registry endpoints on DagsHub

Not all hosted MLflow services support full Model Registry APIs. Attempting to use registry features may produce endpoint errors.

Instead of:

mlflow.sklearn.log_model(best_model, "model", registered_model_name="Best Model")

Use:

pickle.dump(best_model, open(model_path, "wb"))
mlflow.log_artifact(model_path, artifact_path="model")

This ensures compatibility while preserving artifact versioning.

Why this project is hiring-manager friendly

This pipeline demonstrates:

* Configuration discipline (`params.yaml`)

* Data versioning (DVC)

* Experiment traceability (MLflow)

* Remote collaboration (DagsHub)

* Artifact management

* Environment isolation

It signals that you understand reproducibility, not just modeling.

What I’d build next

To elevate this into production-grade MLOps:

* Add schema validation tests

* Introduce CI pipelines to execute smoke experiments

* Containerize with Docker

* Add feature stores or transformation pipelines

* Transition to MLflow’s `pyfunc` format when registry support is available

Wrap-up

This pipeline moves beyond exploratory notebooks into structured experimentation. It demonstrates that you can version data, track experiments, manage artifacts, and collaborate remotely. It shows engineering maturity — and that is what hiring managers evaluate.

Reproducibility is not a feature.

It is the foundation of trustworthy machine learning.

From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

What we’ll build

Tech stack

Parameters

Stage 1 — Preprocess

Stage 2 — Train with hyperparameter search and MLflow tracking

Stage 3 — Evaluate

Running locally

Optional: DVC stages for full reproducibility

A common gotcha: MLflow Model Registry endpoints on DagsHub

Why this project is hiring-manager friendly

What I’d build next

Wrap-up

Related Articles

CI/CD Pipeline Best Practices for Modern Teams