Skip to content

kevbuh/nn-trainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Neural Network Training

PROJECT_DIAGRAM
Distributed Neural Network Training
Kevin Buhler

A service that fully automates deep-learning training jobs. It works for any MLP model and data, as I found a way to upload and download general PyTorch models using TorchScript.

You can upload a TorchScript model and NumPy dataset. You then make a POST request to /train that will create a new training job for you. The compute side of things will automatically be taken care of (no worrying about what GPU). You retrieve a hash ID which you can use to query /status/<hash_id> to retrieve the real-time status of your model’s loss. Once satisfied with the loss, or if the training job is done, you can send another GET request to /model/<hash_id> to download the model.

Presentation: https://youtu.be/28F5CsZQOkM

Slideshow: https://docs.google.com/presentation/d/1k9oQsrNGXxIDuBNtueDJWNB6viMZAbZzK02bsGo-EB4/edit?usp=sharing

How to run (locally)

Ensure that Docker and Kubernetes are installed on your machine and then run:

pip install requirements.txt
chmod +x run.sh
./run.sh

This will run the Flask server at http://127.0.0.1:5000. The amount of workers is handled by Kubernete's horizontal pod autoscaling. You can configure the max amount of worker in worker/autoscale.yaml.

Demo

Once you have started the system up using the above commands, you can run the demo:

python3 demo.py

This will upload a simple MLP to the cloud and train it on MNIST for 2 epochs. It will then download the model and then run inference and collect accuracy. Overall this should take about 3 minutes.

Prometheus

If you want to set up the basic Prometheus server:

chmod +x prometheus.sh
./prometheus.sh

You can then go to http://localhost:9090/query to see the metrics dashboard.

API Routes

POST /train: 
- Create a training run
- Returns a SHA-256 hash to track training run

Body parameters
- model: upload a base64 encoded torchscript buffer
- data: numpy data to train on
- labels: numpy labels
- lr: learning rate
- epochs: number of times to iterate through entire data
- batch_size: number of samples per batch
GET /status/<string:hash_id>
- Returns latest batch loss from your training run
GET /model/<string:hash_id>
- Downloads trained model to your machine

Software

  • Kubernentes/Prometheus
  • PyTorch/TorchScript
  • MinIO
  • Flask
  • RabbitMQ
  • Redis

About

Distributed Neural Network Training with TorchScript, Redis, RabbitMQ, K8

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors