From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub
A compact, hiring-manager-friendly ML pipeline that goes from raw CSV to reproducible experiments using DVC stages and MLflow tracking on DagsHub, with metrics and model artifacts logged every run.

# From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub
Hiring managers want to see more than a Jupyter notebook — they want reproducibility, traceability, and the ability to collaborate at scale. This article shows a compact yet production-lean pipeline that checks those boxes using:
- DVC for data/model versioning and pipeline orchestration
- MLflow for experiment tracking and artifact logging
- scikit-learn for modeling (RandomForest)
- DagsHub as a remote hub for experiments and data
We go from raw CSV to a tracked, reproducible pipeline with hyperparameter search, metrics, and model artifacts — all wired to a remote tracking server.
What we’ll build
A simple ML pipeline on the Pima Indians Diabetes dataset (Outcome = 0/1), structured as:
params.yaml
README.md
requirements.txt
data/
raw/
data.csv
processed/
data.csv
src/
preprocess.py
train.py
evaluate.py
models/
model.pkl # generatedStages:
- Preprocess: read raw CSV → write cleaned/processed CSV
- Train: split, grid search a RandomForest, log metrics/artifacts to MLflow
- Evaluate: load the trained model and log evaluation metrics
This repo is DVC-ready and MLflow-enabled so you can rerun/compare experiments reliably and collaborate with others.
Tech stack
- Python 3.12 (works with 3.9+ as well)
- scikit-learn
- DVC (with optional remote storage)
- MLflow (remote tracking on DagsHub)
Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtParameters
`params.yaml` keeps configuration centralized and reproducible:
preprocess:
input: data/raw/data.csv
output: data/processed/data.csv
train:
data: data/processed/data.csv
model: models/model.pkl
random_state: 42
n_estimators: 100
max_depth: 5Stage 1 — Preprocess
A minimal preprocessing step that reads raw CSV and writes processed CSV. You can extend this to handle imputations, feature engineering, etc.
# src/preprocess.py
import pandas as pd
import yaml
import os
params = yaml.safe_load(open("params.yaml"))["preprocess"]
def preprocess(input_path, output_path):
data = pd.read_csv(input_path)
os.makedirs(os.path.dirname(output_path), exist_ok=True)
data.to_csv(output_path)
print(f"Preprocessed data saved to {output_path}")
if __name__ == "__main__":
preprocess(params["input"], params["output"])Run it:
python src/preprocess.pyWith DVC you can also declare this as a stage and make it fully reproducible with dependency tracking (see the DVC section below).
Stage 2 — Train with hyperparameter search and MLflow tracking
Key ideas:
- Reproducible train/test split
- GridSearchCV over a RandomForest
- Logging metrics, params, confusion matrix, classification report
- Saving the model locally (pickle)
- Logging the model file to MLflow as an artifact (compatible with DagsHub)
Tip for credentials (don’t hardcode secrets):
export MLFLOW_TRACKING_URI="https://dagshub.com/<user>/<repo>.mlflow"
export MLFLOW_TRACKING_USERNAME="<your_username>"
export MLFLOW_TRACKING_PASSWORD="<your_token>"On typical runs, accuracy is often around ~0.75–0.80 (varies by split). You can inspect full metrics and artifacts in the MLflow UI.
Stage 3 — Evaluate
Load the saved model and compute evaluation metrics across the processed dataset; log the evaluation accuracy to MLflow.
# src/evaluate.py (core excerpts)
import pandas as pd
import pickle
from sklearn.metrics import accuracy_score
import yaml
import os
import mlflow
params = yaml.safe_load(open("params.yaml"))["train"]
def evaluate(data_path, model_path):
data = pd.read_csv(data_path)
X = data.drop(columns=["Outcome"])
y = data["Outcome"]
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
with open(model_path, "rb") as f:
model = pickle.load(f)
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
mlflow.log_metric("evaluation_accuracy", accuracy)
print(f"Evaluation Accuracy: {accuracy}")
if __name__ == "__main__":
evaluate(params["data"], params["model"])Running locally
1) Prepare environment and install deps
2) Configure MLflow tracking to DagsHub
3) Run the stages directly:
python src/preprocess.py
python src/train.py
python src/evaluate.pyOptional: DVC stages for full reproducibility
Declare the workflow as DVC stages:
dvc stage add -n preprocess \
-p preprocess.input,preprocess.output \
-d src/preprocess.py -d data/raw/data.csv \
-o data/processed/data.csv \
python src/preprocess.py
dvc stage add -n train \
-p train.data,train.model,train.random_state,train.n_estimators,train.max_depth \
-d src/train.py -d data/raw/data.csv \
-o models/model.pkl \
python src/train.py
dvc stage add -n evaluate \
-d src/evaluate.py -d models/model.pkl -d data/raw/data.csv \
python src/evaluate.pyThen:
dvc repro # re-run pipeline when data/code/params changeYou can push large files (data, models) to a DVC remote (S3, DagsHub, etc.) and keep the git repo light.
A common gotcha: MLflow Model Registry endpoints on DagsHub
On some hosted MLflow endpoints (including DagsHub), Model Registry APIs may not be supported. If you try:
mlflow.sklearn.log_model(best_model, "model", registered_model_name="Best Model")…you may see an error mentioning an unsupported endpoint. The safe alternative is to save and upload your model as a plain artifact:
pickle.dump(best_model, open(model_path, "wb"))
mlflow.log_artifact(model_path, artifact_path="model")This keeps your model versioned within the run while avoiding registry-specific APIs.
Why this project is hiring-manager friendly
- Reproducibility by design: params.yaml, DVC stages, pinned requirements
- Observability: MLflow logs metrics, params, and artifacts for every run
- Good engineering practices: isolated environment, clear structure
- Practical MLOps exposure: remote experiment tracking (DagsHub) and pipeline orchestration (DVC)
What I’d build next
- Add unit tests for data and model contracts (e.g., column schemas, shapes)
- Add CI to validate pipeline stages and run a smoke experiment on PRs
- Package model for inference (FastAPI + Docker) and deploy a small endpoint
- Add feature engineering and proper train/validation/test splits
- Switch to MLflow pyfunc format when using a registry-capable server
Wrap-up
You now have a reproducible, remotely tracked ML pipeline that’s easy to demo and extend. Fork it, tweak the model/params, and compare runs like a pro.
Engineering Team
The engineering team at Originsoft Consultancy brings together decades of combined experience in software architecture, AI/ML, and cloud-native development. We are passionate about sharing knowledge and helping developers build better software.
