Data is the fuel that powers machine learning models. As data continuously evolves during ML experimentation, tracking different versions becomes critical. MLflow is a popular open source platform that provides integrated capabilities for managing data versions and lineage.
Understanding MLflow
MLflow stitches together the complete machine learning lifecycle within a single, shared platform. Its modular components help track experiments, package models, manage artifacts, and visualize results.

MLflow components for managing machine learning lifecycle
Tracking records metrics and metadata as runs while training models. It creates a full audit trail of model development from raw datasets to final models.
Projects packages ML code as reproducible runs to test easily across environments. They define parameters, dependencies and entry points for consistent executions.
Models facilitates deploying models from diverse ML libraries to endpoints like Kubernetes and TensorFlow Serving.
Registry creates a central model repository for managing versions, annotations, workflows and approval policies.
Artifacts are files, datasets and models logged by runs for persistence and sharing between runs.
Of these, artifacts and tracking are essential for data versioning capabilities.
Why Data Versioning Matters
Maintaining multiple copies of evolving datasets is important as ML models are extremely sensitive to data changes. Without visibility into the impact of data on model quality, aberrations can creep in silently putting models at risk.
According to a 2021 Alteryx survey, 36% of companies lose track of what training data models were built on within months. Further, 24% attributed bad AI behavior to lack of data lineage.
Data versioning delivers various advantages:
- Reproducibility: Reuse exact training dataset for accurate model recreations
- Safeguarding: Prevent undetected data errors from cascading to models
- Auditability: Document societal impact of data changes for governance
- Debugging: Identify data issues responsible for performance dips
- Attribution: Quantify influence of new data sources on metrics
- Collaboration: Enable teams to concurrently improve datasets
Despite wide recognition, only 34% of firms practice systematic data versioning, as per VentureBeat. Common workaround of file copies with incremental names have proven inadequate for scale, speed and completeness.
Achieving Integrated Versioning With MLflow
MLflow Artifacts subsystem offers turnkey management for data versions, model files, images, log files and more. The key versioning APIs are:
log_artifact and log_artifacts copy local files/directories into run-specific artifact stores. Uploaded artifacts can be large datasets, configuration files, intermediate outputs, or supplementary material. Artifacts persist across client sessions.
download_artifacts retrieves artifacts from specified runs, enabling prior versions to be used for tests, diagnostics or compliance. Integrations for cloud stores like S3, Azure Blob, GCS handle large artifacts.
All artifact uploads are uniquely versioned. Changing an existing artifact retains its name while storing a fully distinct copy. No overwrite happens thus preventing accidental data damage or loss.
Such uploads automatically create a complete activity trail even without custom logging code. The full evolution of data can be replayed for audits. Any production model can be traced back to its distinct training dataset.

Artifacts enable tracking multiple dataset versions across runs
Frameworks like DVC and Daten have plugins to use MLflow tracking as central control plane for their versioned datasets. Data build pipelines natively integrate artifacts usage.
Lineage tracking becomes effortless even as datasets change form – raw loads to transformed to final training sets. Intermediate failed iterations are still preserved thanks to artifacts.
Step-by-Step Guide For Data Versioning
Here is an end-to-end demonstration of managing data versions for a churn prediction model using MLflow APIs:
1. Initialize tracking: Set the backend store as SQLite database to record runs
mlflow.set_tracking_uri("sqlite:///mlruns.db")
2. Load raw data: Source dataset raw_data.csv, start run raw_version to log
with mlflow.start_run(run_name=‘raw_version‘) as run:
mlflow.log_artifact("raw_data.csv", "data")
3. Cleansed version: Fix errors, filter anomalies, save as cleansed_data.csv. Log it in run v1.
with mlflow.start_run(run_name=‘v1‘) as run:
mlflow.log_artifact("cleansed_data.csv", "data")
4. Enriched version: Join external customer data, apply feature encoding, output enriched_data.csv. Log it as the v2 artifact.
with mlflow.start_run(run_name=‘v2‘) as run:
mlflow.log_artifact("enriched_data.csv", "data")
5. Model building: Load logged datasets, train models, evaluate accuracy.
def train_model(run_id):
dataset = load_dataset(run_id)
model = build_model(dataset)
return evaluate(model)
v1_acc = train_model(‘v1‘)
v2_acc = train_model(‘v2‘)
6. Run Comparison: Leverage tracking UI to visually compare metrics across versions. Identify optimal data schema.
7. Query artifacts: Retrieve specific data versions programmatically for experiments.
dataset = mlflow.artifacts.download_artifacts(
run_id=‘v2‘, path=‘data‘, dst_path=‘.‘)
This self-contained pipeline keeps data synchronized with modeling without external coordination. Shared visibility across the project lifecycle helps align decisions to arrive at robust models.
Integrating With Data Platforms
For enterprise usage, MLflow artifact stores can be persisted in centralized data lakes and warehouses instead of local storage.

Ingest datasets from and log artifacts to shared data lakes
This enables uniform data discovery and governance simultaneously with ML experimentation. It also facilitates reuse of curated data beyond individual projects.
Here are some standard integrations possible:
S3 and Minio: Objects uploaded as artifacts are versioned by underlying object versioning capabilities. Data teams can directly access versions for other applications.
HDFS: Clustered storage combined with Spark integration allows large scale artifact logging at low latency. Data pipelines can source versioned datasets.
Azure Blob: Native integration allows both uploading and mounting Blob containers as artifact stores. Data Factory can materialize datasets from mounted locations.
Snowflake and Delta Lake: Custom hooks log artifact meta to tables. Bulk uploaded artifacts become queryable via external stages.
DVC Remote Stores: Remote caches like S3, GCS buckets used by DVC CLI can double as artifact repositories sharing same data.
Securing and Managing Data Access
With data scattered across versioned artifacts and runs, access controls and organizational best practices become necessary.
- Consistency: Ensure consistent naming, formats, schemas for artifact datasets
- Documentation: Attach metadata like data dictionaries, schema definitions to ease discovery
- Granular access: Leverage provider native IAM roles to restrict data visibility
- Retention rules: Set artifact expiry duration based on governance policies
- Query interfaces: Build search features, audit trails into experiential interfaces
- Monitoring: Track artifact usage volume, run frequency to identify hotspots

According to Fivetran‘s 2021 survey, over 92% of executives prioritize implementing data access controls this year.
Real-World Examples
Here are some inspiring examples highlighting innovative usage of MLflow artifacts:
- Trustworthy AI: Log test dataset splits along with artifacts to detect hidden data biases before productionization
- Geospatial ML: Version satellite, aerial imagery datasets as artifacts to assess model accuracy across locations and weather conditions
- Robotics: Log raw sensor stream data as artifacts to simulate virtual test environments reducing physical trials
- Drug Discovery: Create bioactivity data dashboard aggregating statistics across artifact versions identifying candidate molecule growth
Key Takeaways
- MLflow provides flexible, integrated versioning for datasets and artifacts via lightweight APIs
- Its builds complete lineage tracking of how data transforms over successive ML iterations
- Broader infrastructure integrations enable scalable, governed data flows across organization
- Advanced usage like bias mitigation and simulation testing provide standout benefits
About the Author
As a machine learning expert with over 7 years of full-stack development experience, I now lead data science initiatives for Fortune 500 companies. I frequently blog about leveraging the latest ML tools and architectures to build impactful AI applications.


