Arcada Labs

Evaluating Mobile App-Building Agents

As coding agents mature, evaluation has become a core bottleneck. Design Arena’s core goal is to find the edges of model capability by evaluating agents on real-world application building, not toy tasks. We’ve seen models steadily improving quality and reliability in our React/web arena. To keep pushing

Fullstack Arena Methodology

Real-World Full-Stack Web Application Benchmarking This platform tests whether AI coding agents can autonomously build complete, production-grade full-stack web applications with real persistent databases. Each model receives a user prompt describing an application, and must independently design a database schema, seed it with realistic data, build Vercel serverless API routes,

Founding Launch Designer

What if you got to announce the moments that define a new era of intelligence? When Christopher broke Enigma, when Deep Blue beat Kasparov, when AlphaGo beat Lee Sedol—these weren't promotional moments. They were the world discovering what our machines could do, in large part because someone

Design Arena Methodology

Leaderboard Rankings emerge from collective community preferences rather than curated opinions. Each pairwise comparison is weighted equally. The Elo score is approximated through the Bradley-Terry model. The model estimates each model's inherent strength through an iterative algorithm that converges when strength estimates stabilize (threshold: 0.0001) or reaches

Agent Runner: A Standard, Open-Source Agent Harness for Evaluating Real-World Coding Agents

As LLM-based coding agents mature, evaluation has become a core bottleneck. Existing benchmarks largely rely on static tasks and capture only the model’s final output, not the intermediate reasoning steps or tool interactions that define modern agentic systems. To study these systems rigorously, we need instrumentation that reflects how

How Microsoft Research Asia used Design Arena to validate their aesthetic coding benchmark

We recently added AesCoder-4B to Design Arena: a new model trained by a team from Microsoft Research Asia. AesCoder-4B is comparatively tiny at just 4 billion parameters, but as of the time of this writing, it is beating several flagship models in the Website rankings, including Gemini 2.5 Pro,

Introducing Micro Evals

An overview of our live skills assessment methodology. What are Micro Evals? The current bottleneck in model progress isn’t figuring out how to improve models—it’s knowing what needs improvement. Creating realistic coding tasks for training or RL environments is difficult when you don’t know where models

In pursuit of a benchmark for human taste

We know when we like something, but when asked why, we stumble over our words and hallucinate reasoning. We contradict ourselves—we don’t know why we like a thing, we just do. There exist people who understand our preferences better than we do ourselves. Let’s call them Tastemakers.

The prompt is not the picture

“When an artist uses a conceptual form of art, it means that all of the planning and decisions are made beforehand and the execution is a perfunctory affair. The idea becomes a machine that makes the art.” – Sol LeWitt, Paragraphs on Conceptual Art (1966)

Latest