Streaming Data for Model Training

A data streaming pipeline for efficient model training on large datasets.

Overview

This project implements streaming data ingestion for ML model training, allowing models to train on data larger than available RAM by streaming batches directly from storage.

Architecture

Data Source → Stream Processor → Batch Generator → Model Trainer

Getting Started

Prerequisites

Python 3.10+
- Jupyter Notebook
- - Required packages (see notebooks for imports)
  - Running
  - 1. Open the notebook in Jupyter or Google Colab
    2. 1. Configure your data source path
      2. Run cells sequentially
        
        Features
        
        Memory-efficient data streaming
        
        Configurable batch sizes
        
        Support for multiple data formats
        
        License
        
        MIT

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
SlideImages		SlideImages
assets/images		assets/images
docs		docs
huggingface_spaces		huggingface_spaces
medium_post		medium_post
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
AML_Document_Processing_Pipeline_v2.ipynb		AML_Document_Processing_Pipeline_v2.ipynb
NewVerPynbAgent.ipynb		NewVerPynbAgent.ipynb
README.md		README.md
Term Project Blog Report Guidelines.md		Term Project Blog Report Guidelines.md
package-lock.json		package-lock.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming Data for Model Training

Overview

Architecture

Getting Started

Prerequisites

Running

Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Streaming Data for Model Training

Overview

Architecture

Getting Started

Prerequisites

Running

Features

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages