Automation

    From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    A compact, hiring-manager-friendly ML pipeline that goes from raw CSV to reproducible experiments using DVC stages and MLflow tracking on DagsHub, with metrics and model artifacts logged every run.

    Originsoft TeamEngineering Team
    November 16, 2025
    5 min read
    From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    From CSV to Reproducible ML Experiments: Building a Clean ML Pipeline with DVC + MLflow on DagsHub

    In today’s machine learning job market, a single Jupyter notebook is no longer a compelling artifact. While notebooks remain useful for exploration, hiring managers evaluating machine learning engineers are looking for something much more substantial: reproducibility, experiment traceability, environment isolation, artifact management, and evidence that the candidate understands the difference between experimentation and engineering. The ability to move from raw data to a structured, versioned, repeatable experiment pipeline is now a baseline expectation.

    This article walks through a compact yet production-lean machine learning pipeline designed to demonstrate exactly those capabilities. The goal is not to build the most complex model, but to build the most structurally sound workflow. We will combine DVC for data and pipeline versioning, MLflow for experiment tracking, scikit-learn for modeling, and DagsHub as a remote collaboration and tracking hub. By the end, we will have a system that transforms a raw CSV file into a reproducible experiment complete with hyperparameter search, metrics logging, model artifacts, and remote traceability — all structured in a way that scales beyond toy experimentation.

    The emphasis here is engineering discipline. The dataset is simple. The architecture is what matters.


    What we’ll build

    We will construct a simple yet fully structured machine learning pipeline using the Pima Indians Diabetes dataset, where the target variable `Outcome` represents a binary classification problem (0 or 1). While the modeling task is straightforward, the pipeline architecture is intentionally clean and modular.

    params.yaml
    README.md
    requirements.txt
    
    data/
      raw/
        data.csv
      processed/
        data.csv
    
    src/
      preprocess.py
      train.py
      evaluate.py
    
    models/
      model.pkl   # generated

    This structure is deliberate. Data is separated into raw and processed layers to prevent accidental contamination of original inputs. Source code is isolated under `src/`, and model artifacts are written to a dedicated directory. Parameters are centralized in `params.yaml` to ensure that configuration changes are tracked independently of code modifications. This separation of concerns is the foundation of reproducible machine learning.

    The pipeline consists of three stages:

    * Preprocess: ingest raw CSV data and produce a processed dataset.

    * Train: perform a reproducible train/test split, conduct hyperparameter search over a RandomForest classifier, log metrics and artifacts to MLflow, and serialize the trained model.

    * Evaluate: reload the trained model, compute evaluation metrics, and log evaluation results for traceability.

    Each stage can be executed independently, but when combined with DVC, they form a declarative, dependency-tracked workflow that can be re-run automatically whenever inputs change.

    This design mirrors real-world MLOps systems in miniature.


    Tech stack

    The stack is intentionally pragmatic and widely adopted in industry:

    * Python 3.12 (compatible with 3.9+)

    * scikit-learn for modeling

    * DVC for data versioning and pipeline orchestration

    * MLflow for experiment tracking

    * DagsHub as a hosted remote for MLflow and optional DVC storage

    This combination represents a modern baseline for reproducible ML experimentation without requiring heavy infrastructure investment.

    Install dependencies in an isolated environment:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

    The isolated virtual environment ensures dependency reproducibility. In production contexts, this would be extended with Docker or CI environment pinning, but for this compact pipeline, a properly versioned `requirements.txt` suffices.


    Parameters

    A central principle of reproducible experimentation is that configuration must be externalized from code. Hardcoding hyperparameters inside training scripts makes experiment comparison opaque and error-prone. Instead, we define parameters declaratively in `params.yaml`:

    preprocess:
      input: data/raw/data.csv
      output: data/processed/data.csv
    
    train:
      data: data/processed/data.csv
      model: models/model.pkl
      random_state: 42
      n_estimators: 100
      max_depth: 5

    This file becomes the canonical source of truth for experiment configuration. When hyperparameters change, DVC can detect the modification and selectively re-run dependent stages. Combined with Git version control, every experiment becomes traceable not only by metrics but by exact configuration state.

    This approach scales naturally into more complex projects where feature engineering, model architecture, or training schedules vary across runs.


    Stage 1 — Preprocess

    The preprocessing stage is intentionally minimal, but structurally important. Even when transformation is simple, isolating it into its own stage enforces discipline and prepares the pipeline for future expansion.

    # src/preprocess.py
    import pandas as pd
    import yaml
    import os
    
    params = yaml.safe_load(open("params.yaml"))["preprocess"]
    
    def preprocess(input_path, output_path):
        data = pd.read_csv(input_path)
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        data.to_csv(output_path)
        print(f"Preprocessed data saved to {output_path}")
    
    if __name__ == "__main__":
        preprocess(params["input"], params["output"])

    Although this example merely copies the CSV, in production scenarios this stage would include imputations, scaling, encoding categorical variables, and schema validation. By separating preprocessing into its own executable unit, we prevent leakage of transformation logic into training scripts and make data lineage explicit.

    Run locally:

    python src/preprocess.py

    Later, when declared as a DVC stage, this step becomes dependency-tracked and automatically re-executed when input files or parameters change.


    Stage 2 — Train with hyperparameter search and MLflow tracking

    The training stage demonstrates where engineering discipline truly matters. Instead of a naive fit-and-print script, we implement:

    * Deterministic train/test splitting

    * GridSearchCV for hyperparameter exploration

    * Logging of parameters and metrics

    * Artifact tracking

    * Model serialization

    Credential configuration should never be hardcoded:

    export MLFLOW_TRACKING_URI="https://dagshub.com/<user>/<repo>.mlflow"
    export MLFLOW_TRACKING_USERNAME="<your_username>"
    export MLFLOW_TRACKING_PASSWORD="<your_token>"

    This ensures secure connection to a remote MLflow tracking server hosted on DagsHub. Every run is logged remotely, enabling collaboration, auditability, and experiment comparison across machines and team members.

    On typical runs, accuracy for this dataset ranges between 0.75 and 0.80 depending on the split. However, the raw metric is less important than the fact that:

    * Hyperparameters are logged.

    * Artifacts are preserved.

    * Runs are comparable in a centralized UI.

    This is the difference between experimentation and engineering.


    Stage 3 — Evaluate

    Evaluation is intentionally separated from training. While some pipelines combine these steps, isolating evaluation enforces clarity of responsibility.

    # src/evaluate.py (core excerpts)
    import pandas as pd
    import pickle
    from sklearn.metrics import accuracy_score
    import yaml
    import os
    import mlflow
    
    params = yaml.safe_load(open("params.yaml"))["train"]
    
    def evaluate(data_path, model_path):
        data = pd.read_csv(data_path)
        X = data.drop(columns=["Outcome"])
        y = data["Outcome"]
    
        mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
    
        with open(model_path, "rb") as f:
            model = pickle.load(f)
    
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        mlflow.log_metric("evaluation_accuracy", accuracy)
        print(f"Evaluation Accuracy: {accuracy}")
    
    if __name__ == "__main__":
        evaluate(params["data"], params["model"])

    By logging evaluation metrics separately, we preserve clarity in experiment lifecycle stages. In more advanced systems, this stage might evaluate against a hold-out test set or a temporal validation slice to avoid data leakage.


    Running locally

    Execution flow:

    python src/preprocess.py
    python src/train.py
    python src/evaluate.py

    This linear workflow demonstrates modularity. Each script performs a single responsibility and can be tested independently.


    Optional: DVC stages for full reproducibility

    DVC transforms this pipeline into a fully declarative DAG.

    dvc stage add -n preprocess \
      -p preprocess.input,preprocess.output \
      -d src/preprocess.py -d data/raw/data.csv \
      -o data/processed/data.csv \
      python src/preprocess.py

    Subsequent stages follow similarly.

    dvc repro

    Now, whenever:

    * Code changes

    * Data changes

    * Parameters change

    DVC intelligently re-runs only affected stages. This selective recomputation is critical for scaling experiments efficiently.

    Large artifacts such as datasets and models can be pushed to remote storage while keeping the Git repository lightweight.


    A common gotcha: MLflow Model Registry endpoints on DagsHub

    Not all hosted MLflow services support full Model Registry APIs. Attempting to use registry features may produce endpoint errors.

    Instead of:

    mlflow.sklearn.log_model(best_model, "model", registered_model_name="Best Model")

    Use:

    pickle.dump(best_model, open(model_path, "wb"))
    mlflow.log_artifact(model_path, artifact_path="model")

    This ensures compatibility while preserving artifact versioning.


    Why this project is hiring-manager friendly

    This pipeline demonstrates:

    * Configuration discipline (`params.yaml`)

    * Data versioning (DVC)

    * Experiment traceability (MLflow)

    * Remote collaboration (DagsHub)

    * Artifact management

    * Environment isolation

    It signals that you understand reproducibility, not just modeling.


    What I’d build next

    To elevate this into production-grade MLOps:

    * Add schema validation tests

    * Introduce CI pipelines to execute smoke experiments

    * Containerize with Docker

    * Add feature stores or transformation pipelines

    * Transition to MLflow’s `pyfunc` format when registry support is available


    Wrap-up

    This pipeline moves beyond exploratory notebooks into structured experimentation. It demonstrates that you can version data, track experiments, manage artifacts, and collaborate remotely. It shows engineering maturity — and that is what hiring managers evaluate.

    Reproducibility is not a feature.

    It is the foundation of trustworthy machine learning.

    #DVC#MLflow#DagsHub#MLOps#Machine Learning#Reproducibility#Automation
    Originsoft Team

    Engineering Team

    The engineering team at Originsoft Consultancy brings together decades of combined experience in software architecture, AI/ML, and cloud-native development. We are passionate about sharing knowledge and helping developers build better software.