Project Delphi

Logo
Test Run on 4 Resolved Events
Test Run on 24 Resolved Events (default starter dataset pulled directly from the ai-prophet-datasets)

Inspiration

Going into Prophet Arena, I knew one thing for sure that treating an LLM like a magic 8-ball is a great way to absolutely tank your Brier score.

LLMs are notoriously overconfident and terrible at probability math. If your model blindly assigns a 95% probability to an outcome and misses, the quadratic penalty of the Brier score completely destroys your average. I realized I didn't need a "smarter" model, I needed an agent that actually understood its own limitations. I wanted to build a hyper-defensive forecaster that completely divorced reading comprehension from probability math. That led to Delphi: a hybrid architecture that uses an LLM purely for forensic text extraction (the Brain) and relies on hardcoded Python logic to place the actual bets (the Brawn).

What it does

Delphi is a fully automated prediction market agent, but underneath, it operates more like a forensic accountant than a gambler. When you feed it a slate of events, whether that's an NBA game or a Senate confirmation vote, it doesn't just guess. Instead of just guessing, Delphi:

Scrapes the live web to figure out if the event has already happened or is still pending.
Extracts live betting odds, polling data, or historical base rates.
Passes those findings through a hardcoded Python engine that mathematically bounds the probabilities to ensure the agent never takes catastrophic Brier score penalties.

How we built it

I built Delphi using FastAPI, Python, OpenRouter (gpt-4o-mini), and the Tavily Search API. The core of the project relies on a Two-Pass Search Architecture and a strict structural prompt. The architecture breaks down into three main pillars:

The Two-Pass Search: To stop the LLM from getting confused by temporal data, I split the web scraping. Pass 1 strictly hunts for historical, concrete facts to see if an event is over. Pass 2 actively hunts for "current odds, polling, and consensus" for future events. I merge these streams to give the LLM perfect context.
The Forensic JSON Engine: I completely banned the LLM from doing its own math. Instead, I forced it into a strict JSON schema. Before it's allowed to label an event as CONFIRMED, PROJECTION, or UNKNOWN, it physically has to output a temporal_date_check and a vote_breakdown_check.
Deterministic Clamps: This is my safety net. Python intercepts the LLM's findings and applies rigid mathematical rules. If we have hard textual proof of a winner, it locks in 0.95. If we have live polling for a future event, it limits max confidence to 0.75 to account for volatility. If the LLM is completely flying blind, it falls back to historical base rates clamped at 0.60.

Challenges we ran into

The "SCOTUS Trap" Supreme Court predictions completely broke my LLM. In Louisiana v. Callais, the ruling was 6-3, and the market asked how many justices voted in favor of Louisiana (which was actually the 3 dissenters). The LLM saw the text "6-3", grabbed the 6, slapped a 95% confidence on it, and handed me a devastating 1.8908 Brier penalty. I fixed this by forcing the model to explicitly write out a vote_tally_deduction in the JSON, manually mapping majority vs. minority counts before it picked a number.
Search Context Poisoning When I initially optimized my search query to find live odds, it ruined my past events. Tavily started pulling up old Reddit fan theories for The Masked Singer instead of the actual finale results, making the LLM hallucinate the winner. Writing the asynchronous Two-Pass Search to cleanly separate past facts from future odds solved this temporal headache.

Accomplishments that we're proud of

After tweaking the math clamps, my local calibration on resolved events settled at an average Brier score of 0.0766. Given that 0.35 is usually a highly competitive baseline in prediction markets, hitting the 0.07 range felt incredible.

During our final stress test on the Prophet Arena 1,200-event subset, I actually ran out of Tavily API credits halfway through. Instead of the server crashing or throwing 500 errors, Delphi caught the exception, realized it was blind, seamlessly fell back to its clamped 60/40 historical base rates, and successfully completed the run without crashing.

What we learned

The biggest takeaway for me was that prompt engineering isn't about asking the LLM to "think carefully." It's about enforcing structural constraints. The second I forced the model to fill out a rigid, step-by-step forensic checklist before declaring an answer, the hallucination rate tanked.

What's next for Project Delphi

Right now, Delphi scrapes the web to estimate live odds. The immediate next step is wiring up direct API hooks to Polymarket, Kalshi, or DraftKings so it can ingest raw, structured market consensus data natively. I'm also really interested in building out a multi-agent debate system, having an open-source model like Llama-3 act as a "devil's advocate" to aggressively fact-check gpt-4o-mini before the Python engine finalizes the bet.

Built With

fastapi
gpt4o
python
uvcorn

Updates

Nasrullah Babar Mufti started this project — May 17, 2026 05:29 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.