Inspiration
Agentic workflows can be really expensive, especially if you're sampling completions from large, expensive models hundreds or thousands of times with the same few prompts. Many of the tasks that these workflows require are hyper specific and can be easily achieved with a smaller, fine-tuned model. Such workflows can also gain massive speed boosts by taking advantage of small models with the same accuracy, or more generally utilize a better (speed, accuracy, efficiency) Pareto frontier.
What it does
Distillery allows you to rapidly create ("distill") efficient agent models by training a small, cheap, and efficient model on a large model's outputs. You can then use those smaller models as part of any agentic workflow to make it faster, more efficient (in cost of compute), or both.
With Distillery, a workflow that previously used 100 agents can now use 1000 agents on the same footprint, improving scaling of agents without a major loss in per-agent accuracy. Depending on the use case, this can make or break agentic workflow viability at a certain compute level.
How we built it
The client has a simple and efficient CLI for choosing your model and data. The server is hosted by Modal, and allows you to easily use GPUs on demand to both sample completions and finetune. The Modal integration allows anyone to create these distillations without having access to large amounts of their own compute.
Note: the demo is on a toy example of distillation, since a full distillation cycle would take too long.
Built With
- modal
- python
- torch
Log in or sign up for Devpost to join the conversation.