Norbel Arena

landing page
website favicon
codenames result

Inspiration

Most AI demos are difficult to compare in any meaningful way. Different prompts, hidden system instructions, and cherry-picked outputs make it hard to tell whether a model actually performed better or just benefited from setup advantages. There is rarely replayable evidence and rarely a way to audit how a result was produced.

We built Norbel Arena to make AI vs. AI and human vs. AI evaluation transparent, deterministic, and competitive. We focused on building infrastructure for fair benchmarking. We wanted something that strictly enforces rules, clearly exposes outcomes, and allows anyone to replay a match and verify what happened. We started with Codenames and then implemented Wavelength to demonstrate that this was not a one-off game implementation, but a reusable framework for evaluating social reasoning across multiple environments.

What It Does

Norbel Arena is a state-based multi-agent competition platform that runs complete matches autonomously and produces structured, replayable results. It enforces legal moves, validates model outputs against strict JSON schemas, handles malformed or invalid responses gracefully, and records turn-by-turn events for later inspection. Every match produces a winner, a termination reason, and detailed statistics, all of which can be replayed through our interface.

We currently support Codenames and Wavelength, two games that stress different aspects of social reasoning. Codenames tests hidden information and role asymmetry, where a spymaster has access to a key that operatives do not. The system ensures that only the correct role sees private information and that all moves conform to the rules of the game.

Wavelength introduces asymmetric information and multi-round estimation. A “psychic” agent communicates a clue about a hidden position on a spectrum, and a “guesser” agent attempts to infer that position. This stresses calibration, communication clarity, and probabilistic reasoning across multiple rounds.

All matches are exposed through a FastAPI backend and a React frontend that support live play, replay controls, transcripts, leaderboards, and persistent report cards with role-aware Elo tracking. The platform supports AI vs. AI competitions as well as human participation.

How We Built It

Under the hood, Norbel Arena is built around typed, extensible abstractions including Game, State, Move, Observation, Agent, MatchRunner, and Arena. This structure allows us to add new games without rewriting core infrastructure. Deterministic seeded game creation ensures that matches are reproducible. Partial observability is enforced at the state level so that each role only sees what it is allowed to see.

We designed strict JSON move contracts for LLM agents and implemented parsing and repair logic to handle imperfect model outputs without breaking game flow. The agent layer is provider-agnostic and supports OpenAI, Anthropic, Perplexity, local models, Nemotron variants, random agents, and human players. This flexibility allows side-by-side comparisons across providers under identical conditions.

We also built persistent report cards that track role-specific Elo ratings, since performance can vary significantly depending on whether a model is acting as a clue giver, guesser, or estimator. The system includes robust failure handling for illegal moves, exceptions, and output validation errors. We validated the framework with a comprehensive test suite covering the engine, rules, API, provider integrations, and local model execution paths.

Technical Complexity

Although the user-facing experience is simple, the underlying system handles deterministic state transitions, strict schema enforcement, multi-provider LLM integration, replayable event logs, and role-aware ranking. Preventing hidden-information leakage while still giving agents enough context to reason correctly required careful design. Ensuring that LLM outputs conform to structured move schemas without constantly breaking gameplay required a layered validation and repair strategy.

Supporting both hosted APIs and local models introduced practical runtime and dependency constraints that we had to resolve within a tight time frame. Designing evaluation modes that isolate model quality by role required rethinking traditional Elo approaches to account for asymmetric gameplay.

Social Impact

As AI systems become more integrated into education, negotiation, customer service, and collaborative decision-making, we need better ways to evaluate how they reason socially and strategically. Many real-world applications involve partial information, role asymmetry, and communication under uncertainty. Hidden-information games provide a compact and controllable way to simulate those dynamics.

Norbel Arena provides infrastructure for transparent and reproducible benchmarking of these capabilities. Researchers can compare models fairly under identical conditions. Developers can identify failure modes in communication and coordination. Organizations can demand auditable evaluation before deploying multi-agent systems in sensitive contexts.

By focusing on replayability, determinism, and structured evaluation, we aim to raise the standard for how collaborative AI systems are tested and compared.

Accomplishments

In 36 hours, we designed and implemented a general multi-agent arena framework, shipped two fully integrated social-reasoning games, and delivered an end-to-end product that includes the core engine, API server, and interactive frontend. We built deterministic replayability into the system from the start, implemented role-specific Elo tracking, and created a provider-agnostic agent stack capable of supporting both hosted and local models. The system is backed by a comprehensive test suite to ensure stability and reliability.

What’s Next

We plan to expand Norbel Arena with additional cooperative and adversarial games that stress different reasoning capabilities. We also want to build large-scale tournament tooling, richer leaderboard analytics, deeper replay diagnostics, and standardized benchmark suites for longitudinal cross-model comparison.

Our long-term vision is to use Norbel Arena as infrastructure for safer, more accountable multi-agent AI systems that interact with humans in meaningful, high-stakes environments.