RiskyRag

Choose your battle!
devstral looking for intelligence through temporally filtered RAG snippets
A civil war between Mistral's devstral and Meta's llama-3.3-70b
Our solutions pushes models to the maximum; lightweight models can't handle our future.

Inspiration

The idea for RiskyRag came at 3:40 AM in Paris. I'd been building a RAG engine for the legal sector when it hit me: what if we could make LLMs forget the future?

Every LLM today knows how history ends. Ask GPT-4 about the Fall of Constantinople and it'll tell you the Ottomans won, even if you're roleplaying as Byzantine Emperor Constantine XI on May 28th, 1453. This "knowledge leak" breaks immersion and makes historical games with AI opponents feel hollow.

We wanted to build something where you could genuinely discover history alongside AI agents who don't know the outcome yet. A game where asking "Will I win this war?" gets an honest "I don't know" instead of a spoiler.

What it does

RiskyRag is a multiplayer strategy game (inspired by Risk) where humans compete against LLM agents using temporal RAG with a time filter. Pick a historical scenario (1453, 1861, or 2026), and every AI agent can only retrieve knowledge from before that date.

The result: AI opponents that reason about history the way people at the time actually did—with uncertainty, incomplete information, and no hindsight bias. You can ask the Ottoman AI about its past conquests, but it genuinely doesn't know what happens next.

Beyond gameplay, RiskyRag doubles as a benchmark for LLM reasoning. We pit different models against each other in historical scenarios and measure win rates, strategic depth, and how well they use limited historical context.

How we built it

Frontend: React + Vite + TailwindCSS with TanStack Router. Interactive map visualization with OpenLayers for historical accuracy.
Backend: Convex for real-time multiplayer sync—game state updates instantly across all players.
Temporal RAG Pipeline: Historical snippets are indexed with event dates. Every RAG query filters by eventDate <= currentGameDate before vector search. Simple but critical.
LLM Routing: OpenRouter for unified access to GPT-4, Claude, Gemini, and Llama models. Self-hosted vLLM option for local inference.
AI Agents: Tool-calling interface where agents can query game state, attack territories, negotiate with opponents, and ask historical questions, but only receive era-appropriate answers.
Game Mechanics: Full Risk implementation with card system, dice-based combat, diplomacy, and fortification rules.
Tournament System: Automated evaluation framework with 4-metric scoring (tool usage, strategic eagerness, outcome, decision quality.
Data Pipeline: Python + uv for scraping Wikipedia historical events, extracting dates, regions, and embeddings via Voyage AI.

Challenges we ran into

Team coordination across time zones. With teammates in Paris (GMT+1) and Pittsburgh (EST), our calls happened at midnight or later. One teammate missed his bus to the venue and couldn't make it, so we had to lock in and do three-man work with two people.

Temporal filtering edge cases. History doesn't have clean timestamps. When exactly did the "siege of Constantinople" become "knowledge"? We had to make judgment calls about publication dates vs. event dates.

Balancing historical accuracy with gameplay. The 2026 scenario required deep geopolitical research—knowing which territories Russia actually controls in Ukraine, Iran's proxy network status post-2025 strikes, and Trump's Greenland stance.

Keeping scope realistic. We initially planned three historical scenarios but ended up with 1453, 1861, and a complex 2026 modern scenario with 12 nations. Better to have three polished scenarios than five broken ones.

What we learned

Temporal RAG is surprisingly underexplored. We couldn't find prior work on knowledge cutoffs for game AI.
Convex makes real-time multiplayer almost trivially easy—most of our time went into game logic, not sync infrastructure.
OpenRouter simplified multi-model testing dramatically—switching from GPT-4 to Claude to Llama became a one-line config change.
AI coding tools (Claude Code) were essential for a small team building a full-stack app in 48 hours.