Inspiration
We noticed coworkers submitting huge prompts to a single AI model, which often took tens of minutes to finish each request. It highlighted a gap in our workflow and showed us there was a real need for a faster way to handle large problems.
What it does
Our solution uses a Kubernetes-style approach. It splits a large prompt into smaller sub-prompts, then distributes them across worker pods based on each pod’s efficiency. It also supports common LLMs such as OpenAI, Anthropic, and Gemini.
How we built it
For our tech stack, we used Kubernetes and Docker for worker orchestration, OpenAI, Gemini, and Anthropic for Compatible API Keys, and React for front-end. To distribute the subprompts among the pods, we calculated the speed of each pod, and divided each speed by the sum of all the speeds to get the fraction of subprompts received by that pod.
Challenges we ran into
We ran into challenges with obtaining the correct number of subprompts to distribute to each pod. We eventually realized that we weren’t clearing the nodes between sessions, which resulted in an incorrect subprompt count. We also faced issues with rate limiting since we sent too many API requests at a time during our testing.
Accomplishments that we're proud of
We’re proud of the fact that we were able to develop a platform that can significantly reduce the time to finish requests to multiple AI models. We pushed through debugging issues that could have derailed the project, and we proved that parallelizing LLM workloads can deliver real performance gains. Seeing our system scale across multiple models and pods felt like a strong validation of our idea.
What we learned
We learned how important orchestration and cleanup are when working with distributed systems. Even small oversights, like not resetting nodes between sessions, can create large downstream problems. We also learned how rate limits shape real-world performance, and how valuable efficient load distribution is when dealing with expensive model calls. Beyond the technical lessons, we learned how to divide work, debug as a team, and keep moving when a problem doesn’t have an obvious fix.
What's next for squarenetes
Next, we want to expand automated load balancing so pods can scale dynamically based on workload. We plan to add support for more model providers and improve the front-end to give users clearer visibility into task distribution and performance. We also want to integrate caching and model-selection logic so Squarenetes can route each subprompt to the model that handles it best. Our long-term goal is to turn Squarenetes into a fully managed platform for fast, parallel LLM computation.
Built With
- claude
- docker
- fastapi
- gemini
- kubernetes
- openai
- openrouter
- python

Log in or sign up for Devpost to join the conversation.