A service that fully automates deep-learning training jobs. It works for any MLP model and data, as I found a way to upload and download general PyTorch models using TorchScript.
You can upload a TorchScript model and NumPy dataset. You then make a POST request to /train that will create a new training job for you. The compute side of things will automatically be taken care of (no worrying about what GPU). You retrieve a hash ID which you can use to query /status/<hash_id> to retrieve the real-time status of your model’s loss. Once satisfied with the loss, or if the training job is done, you can send another GET request to /model/<hash_id> to download the model.
Presentation: https://youtu.be/28F5CsZQOkM
Slideshow: https://docs.google.com/presentation/d/1k9oQsrNGXxIDuBNtueDJWNB6viMZAbZzK02bsGo-EB4/edit?usp=sharing
Ensure that Docker and Kubernetes are installed on your machine and then run:
pip install requirements.txt
chmod +x run.sh
./run.shThis will run the Flask server at http://127.0.0.1:5000. The amount of workers is handled by Kubernete's horizontal pod autoscaling. You can configure the max amount of worker in worker/autoscale.yaml.
Once you have started the system up using the above commands, you can run the demo:
python3 demo.pyThis will upload a simple MLP to the cloud and train it on MNIST for 2 epochs. It will then download the model and then run inference and collect accuracy. Overall this should take about 3 minutes.
If you want to set up the basic Prometheus server:
chmod +x prometheus.sh
./prometheus.shYou can then go to http://localhost:9090/query to see the metrics dashboard.
POST /train:
- Create a training run
- Returns a SHA-256 hash to track training run
Body parameters
- model: upload a base64 encoded torchscript buffer
- data: numpy data to train on
- labels: numpy labels
- lr: learning rate
- epochs: number of times to iterate through entire data
- batch_size: number of samples per batchGET /status/<string:hash_id>
- Returns latest batch loss from your training runGET /model/<string:hash_id>
- Downloads trained model to your machine- Kubernentes/Prometheus
- PyTorch/TorchScript
- MinIO
- Flask
- RabbitMQ
- Redis