A data streaming pipeline for efficient model training on large datasets.
This project implements streaming data ingestion for ML model training, allowing models to train on data larger than available RAM by streaming batches directly from storage.
Data Source → Stream Processor → Batch Generator → Model Trainer
- Python 3.10+
-
- Jupyter Notebook
-
-
Required packages (see notebooks for imports)
-
- Open the notebook in Jupyter or Google Colab
-
- Configure your data source path
-
-
Run cells sequentially
-
- Memory-efficient data streaming
-
- Configurable batch sizes
-
-
Support for multiple data formats
-
MIT
-
-
-