Inspiration

Most apps I build these days have some kind of AI integration, and often I hit the same wall: which model should I actually use? There are comparison tools baked into various platforms, but nothing standalone, nothing free, and nothing that lets me test with my own files and my own prompts (at least easily). I just wanted a simple way to throw a real task at a bunch of models and see who does it best for the money, speed, etc.

What it does

You pick your models, write a prompt (you can also upload your own files + schemas), and hit run. Every model streams its response side by side, in real time, colour-coded so you can tell them apart. When they finish, an AI judge reads all the responses blind and scores them on accuracy, clarity, and completeness. Scores accumulate on a leaderboard across runs. It tracks cost per response down to fractions of a cent, so you can actually make informed decisions about which model fits your budget and your use case.

How I built it

Next.js 16 with App Router, Tailwind CSS, and Framer Motion for the UI. OpenRouter as the LLM gateway, which gives one API key access to every provider. The streaming works through Server-Sent Events: the server opens a parallel fetch to each model, tags every chunk with its model ID, and multiplexes them down a single stream. Deployed on Cloudflare Workers. No database, everything persists in localStorage.

Challenges I ran into

SSE multiplexing was the real puzzle. One client request fans out to N models on the server, and all those chunks need to interleave back through a single stream without dropping tokens or mixing up model IDs. Getting that reliable took most of the build time.

The other tricky bit was making the judge fair. Responses get shuffled before the judge sees them so position in the prompt doesn't bias scores, then the structured JSON scores get mapped back to the original model IDs.

Accomplishments that I'm proud of

The streaming race. Watching 4 models stream tokens at different speeds, panes glowing in their model colours, latency counters ticking live, it genuinely feels like a race. That moment when you hit run and everything lights up is exactly what I had in my head from the start.

Also, the whole thing deployed to Cloudflare Workers under 2 MiB. No database, no auth, no sign-up. Paste a key and go.

What I learned

OpenRouter's OpenAI-compatible API made multi-provider support almost free. Switching between Claude, GPT, Gemini, and DeepSeek is just a model ID string. I also learned that Cloudflare Workers handles SSE streaming natively with the right compatibility flags, which saved a lot of pain.

What's next for Model Bench

Saved prompt templates so you can re-run the same comparison as new models drop. Side-by-side diff view for spotting where responses diverge. And support for image and audio model comparisons, not just text.

Built With

  • ai
  • cloudflareworkers
  • framermotion
  • next.js
  • openrouter
  • typescript
Share this project:

Updates