Distributed Neural Network Training

Distributed Neural Network Training
Kevin Buhler

A service that fully automates deep-learning training jobs. It works for any MLP model and data, as I found a way to upload and download general PyTorch models using TorchScript.

You can upload a TorchScript model and NumPy dataset. You then make a POST request to /train that will create a new training job for you. The compute side of things will automatically be taken care of (no worrying about what GPU). You retrieve a hash ID which you can use to query /status/<hash_id> to retrieve the real-time status of your model’s loss. Once satisfied with the loss, or if the training job is done, you can send another GET request to /model/<hash_id> to download the model.

Presentation: https://youtu.be/28F5CsZQOkM

Slideshow: https://docs.google.com/presentation/d/1k9oQsrNGXxIDuBNtueDJWNB6viMZAbZzK02bsGo-EB4/edit?usp=sharing

How to run (locally)

Ensure that Docker and Kubernetes are installed on your machine and then run:

pip install requirements.txt
chmod +x run.sh
./run.sh

This will run the Flask server at http://127.0.0.1:5000. The amount of workers is handled by Kubernete's horizontal pod autoscaling. You can configure the max amount of worker in worker/autoscale.yaml.

Demo

Once you have started the system up using the above commands, you can run the demo:

python3 demo.py

This will upload a simple MLP to the cloud and train it on MNIST for 2 epochs. It will then download the model and then run inference and collect accuracy. Overall this should take about 3 minutes.

Prometheus

If you want to set up the basic Prometheus server:

chmod +x prometheus.sh
./prometheus.sh

You can then go to http://localhost:9090/query to see the metrics dashboard.

API Routes

POST /train: 
- Create a training run
- Returns a SHA-256 hash to track training run

Body parameters
- model: upload a base64 encoded torchscript buffer
- data: numpy data to train on
- labels: numpy labels
- lr: learning rate
- epochs: number of times to iterate through entire data
- batch_size: number of samples per batch

GET /status/<string:hash_id>
- Returns latest batch loss from your training run

GET /model/<string:hash_id>
- Downloads trained model to your machine

Software

Kubernentes/Prometheus
PyTorch/TorchScript
MinIO
Flask
RabbitMQ
Redis

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
minio		minio
public		public
rabbitmq		rabbitmq
redis		redis
rest		rest
worker		worker
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
prometheus.sh		prometheus.sh
requirements.txt		requirements.txt
run.sh		run.sh
values.yaml		values.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Neural Network Training

How to run (locally)

Demo

Prometheus

API Routes

Software

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Neural Network Training

How to run (locally)

Demo

Prometheus

API Routes

Software

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages