Automation

    From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    A compact, hiring-manager-friendly ML pipeline that goes from raw CSV to reproducible experiments using DVC stages and MLflow tracking on DagsHub, with metrics and model artifacts logged every run.

    Originsoft TeamEngineering Team
    November 16, 2025
    5 min read
    From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    # From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    Hiring managers want to see more than a Jupyter notebook — they want reproducibility, traceability, and the ability to collaborate at scale. This article shows a compact yet production-lean pipeline that checks those boxes using:

    • DVC for data/model versioning and pipeline orchestration
    • MLflow for experiment tracking and artifact logging
    • scikit-learn for modeling (RandomForest)
    • DagsHub as a remote hub for experiments and data

    We go from raw CSV to a tracked, reproducible pipeline with hyperparameter search, metrics, and model artifacts — all wired to a remote tracking server.

    What we’ll build

    A simple ML pipeline on the Pima Indians Diabetes dataset (Outcome = 0/1), structured as:

    params.yaml
    README.md
    requirements.txt
    
    data/
      raw/
        data.csv
      processed/
        data.csv
    
    src/
      preprocess.py
      train.py
      evaluate.py
    
    models/
      model.pkl   # generated

    Stages:

    • Preprocess: read raw CSV → write cleaned/processed CSV
    • Train: split, grid search a RandomForest, log metrics/artifacts to MLflow
    • Evaluate: load the trained model and log evaluation metrics

    This repo is DVC-ready and MLflow-enabled so you can rerun/compare experiments reliably and collaborate with others.

    Tech stack

    • Python 3.12 (works with 3.9+ as well)
    • scikit-learn
    • DVC (with optional remote storage)
    • MLflow (remote tracking on DagsHub)

    Install dependencies:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

    Parameters

    `params.yaml` keeps configuration centralized and reproducible:

    preprocess:
      input: data/raw/data.csv
      output: data/processed/data.csv
    
    train:
      data: data/processed/data.csv
      model: models/model.pkl
      random_state: 42
      n_estimators: 100
      max_depth: 5

    Stage 1 — Preprocess

    A minimal preprocessing step that reads raw CSV and writes processed CSV. You can extend this to handle imputations, feature engineering, etc.

    # src/preprocess.py
    import pandas as pd
    import yaml
    import os
    
    params = yaml.safe_load(open("params.yaml"))["preprocess"]
    
    def preprocess(input_path, output_path):
        data = pd.read_csv(input_path)
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        data.to_csv(output_path)
        print(f"Preprocessed data saved to {output_path}")
    
    if __name__ == "__main__":
        preprocess(params["input"], params["output"])

    Run it:

    python src/preprocess.py

    With DVC you can also declare this as a stage and make it fully reproducible with dependency tracking (see the DVC section below).

    Stage 2 — Train with hyperparameter search and MLflow tracking

    Key ideas:

    • Reproducible train/test split
    • GridSearchCV over a RandomForest
    • Logging metrics, params, confusion matrix, classification report
    • Saving the model locally (pickle)
    • Logging the model file to MLflow as an artifact (compatible with DagsHub)

    Tip for credentials (don’t hardcode secrets):

    export MLFLOW_TRACKING_URI="https://dagshub.com/<user>/<repo>.mlflow"
    export MLFLOW_TRACKING_USERNAME="<your_username>"
    export MLFLOW_TRACKING_PASSWORD="<your_token>"

    On typical runs, accuracy is often around ~0.75–0.80 (varies by split). You can inspect full metrics and artifacts in the MLflow UI.

    Stage 3 — Evaluate

    Load the saved model and compute evaluation metrics across the processed dataset; log the evaluation accuracy to MLflow.

    # src/evaluate.py (core excerpts)
    import pandas as pd
    import pickle
    from sklearn.metrics import accuracy_score
    import yaml
    import os
    import mlflow
    
    params = yaml.safe_load(open("params.yaml"))["train"]
    
    def evaluate(data_path, model_path):
        data = pd.read_csv(data_path)
        X = data.drop(columns=["Outcome"])
        y = data["Outcome"]
    
        mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
    
        with open(model_path, "rb") as f:
            model = pickle.load(f)
    
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        mlflow.log_metric("evaluation_accuracy", accuracy)
        print(f"Evaluation Accuracy: {accuracy}")
    
    if __name__ == "__main__":
        evaluate(params["data"], params["model"])

    Running locally

    1) Prepare environment and install deps

    2) Configure MLflow tracking to DagsHub

    3) Run the stages directly:

    python src/preprocess.py
    python src/train.py
    python src/evaluate.py

    Optional: DVC stages for full reproducibility

    Declare the workflow as DVC stages:

    dvc stage add -n preprocess \
      -p preprocess.input,preprocess.output \
      -d src/preprocess.py -d data/raw/data.csv \
      -o data/processed/data.csv \
      python src/preprocess.py
    
    dvc stage add -n train \
      -p train.data,train.model,train.random_state,train.n_estimators,train.max_depth \
      -d src/train.py -d data/raw/data.csv \
      -o models/model.pkl \
      python src/train.py
    
    dvc stage add -n evaluate \
      -d src/evaluate.py -d models/model.pkl -d data/raw/data.csv \
      python src/evaluate.py

    Then:

    dvc repro  # re-run pipeline when data/code/params change

    You can push large files (data, models) to a DVC remote (S3, DagsHub, etc.) and keep the git repo light.

    A common gotcha: MLflow Model Registry endpoints on DagsHub

    On some hosted MLflow endpoints (including DagsHub), Model Registry APIs may not be supported. If you try:

    mlflow.sklearn.log_model(best_model, "model", registered_model_name="Best Model")

    …you may see an error mentioning an unsupported endpoint. The safe alternative is to save and upload your model as a plain artifact:

    pickle.dump(best_model, open(model_path, "wb"))
    mlflow.log_artifact(model_path, artifact_path="model")

    This keeps your model versioned within the run while avoiding registry-specific APIs.

    Why this project is hiring-manager friendly

    • Reproducibility by design: params.yaml, DVC stages, pinned requirements
    • Observability: MLflow logs metrics, params, and artifacts for every run
    • Good engineering practices: isolated environment, clear structure
    • Practical MLOps exposure: remote experiment tracking (DagsHub) and pipeline orchestration (DVC)

    What I’d build next

    • Add unit tests for data and model contracts (e.g., column schemas, shapes)
    • Add CI to validate pipeline stages and run a smoke experiment on PRs
    • Package model for inference (FastAPI + Docker) and deploy a small endpoint
    • Add feature engineering and proper train/validation/test splits
    • Switch to MLflow pyfunc format when using a registry-capable server

    Wrap-up

    You now have a reproducible, remotely tracked ML pipeline that’s easy to demo and extend. Fork it, tweak the model/params, and compare runs like a pro.

    #DVC#MLflow#DagsHub#MLOps#Machine Learning#Reproducibility#Automation
    Originsoft Team

    Engineering Team

    The engineering team at Originsoft Consultancy brings together decades of combined experience in software architecture, AI/ML, and cloud-native development. We are passionate about sharing knowledge and helping developers build better software.