What is lakeFS?
lakeFS provides a highly scalable data version control architecture designed to bridge the AI infrastructure gap. It serves as a control plane for AI-ready data, managing the complete data lifecycle, provenance, and unified access for AI and data teams. The platform enables organizations to ensure data quality, enforce compliance standards, and catch errors before they impact production environments.
With lakeFS, teams can test pipeline and model changes in isolation on production data without copying data, instantly rollback from data incidents, and track data used in experiments or model training. The system offers full visibility into data history with built-in audit trails and automatically satisfies model governance requirements. It reduces data access friction by allowing work with remote data as if it were local, managing access permissions across all storage from a single interface, and keeping GPUs busy without data waiting periods.
Features
- Format-Agnostic Data Version Control: Supports any data format including Parquet, CSV, JSON, and unstructured data
- Zero Clone Copy for Isolated Environments: Test changes on production data without copying via branches
- Atomic Data Promotion: Promote data changes safely via merge operations
- Cloud-Agnostic Storage: Connects to any object storage with S3 interface including AWS, Azure, and GCP
- Data CI/CD Using Hooks: Implement continuous integration and deployment for data pipelines
- Role-Based Access Control: Manage permissions across storage from single interface with RBAC
- Multiple Storage Backends Support: Works with various storage solutions including local and cloud options
- Audit Logs: Maintain complete visibility into data history and changes
Use Cases
- Testing pipeline and model changes in isolation on production data
- Ensuring data quality and compliance standards before production impact
- Tracking data used in machine learning experiments and model training
- Managing data access permissions across distributed teams and storage systems
- Implementing data version control for AI and ML projects
- Running parallel pipelines with different logic for experimentation
- Comparing large result sets for data science and machine learning
- Streamlining data science and MLOps workflows
FAQs
-
What cloud providers does lakeFS support?
lakeFS Cloud currently supports AWS, Azure and GCP. -
How does lakeFS connect securely to my data?
You can configure a private link connecting your VPC with lakeFS Cloud, providing private connectivity between your VPC and lakeFS Cloud, without exposing your traffic to the public internet. -
Does my data stay in place with lakeFS?
Your data and metadata will always be stored on your VPC. lakeFS manages metadata: Pointers to the locations of the files in your buckets per commit, which also sits inside your buckets.