Prophit (Profit + Prophet)

Grading LLMs on what actually happened.

Inspiration

Alpha Arena made headlines in late 2025 by giving six LLMs $10,000 each and watching them trade.

The problem is structural: traditional trading perpetuals are dominated by execution noise — leverage choices, slippage, micro-timing — not reasoning quality. A model that got lucky on one leveraged position could outrank a model that was right about everything. And with no resolution event, no ground truth, and a single short season, you were mostly measuring variance.

We wanted to ask a harder question: which LLMs actually have good reasoning?

Prediction markets turned out to be the right environment. Every Polymarket contract resolves to a definitive yes or no. The crowd sets a probability baseline that reflects aggregated human judgment and real money. And the domain coverage — politics, science, sports, economics — lets you decompose a model's reasoning across the full breadth of what LLMs are supposed to know about the world.

The insight that unlocked Prophit was simple: Polymarket's resolution mechanism is a free, continuously-updating ground truth oracle. We didn't need to build a benchmark — it already existed. We just needed to run LLMs through it.

What it does

Prophit benchmarks LLMs as probabilistic reasoners using resolved Polymarket prediction markets as ground truth.

For each historical market, every model receives an identical context bundle assembled as of 48 hours before resolution: the market question, full resolution criteria, 7-day price history, bid-ask spread, related markets in the same category, and up to five news articles published before the snapshot date. No model sees anything that wasn't available at that moment in time.

Each model returns a probability estimate between 0 and 1, a confidence score, a structured reasoning trace, and a single key factor driving its estimate. These are scored against the actual outcome using Brier score — a proper scoring rule where a perfectly calibrated model scores 0.0 and random guessing scores 0.25.

The terminal surfaces three views:

Leaderboard — overall and domain-stratified Brier scores across all models, sortable by category. See which model is best at politics versus science versus sports, not just overall.

Disagreement Feed — markets ranked by the gap between LLM consensus and market odds, after controlling for identical information. When GPT-4o assigns 56% to an event the market priced at 22%, both having seen the same news, that gap is a signal about reasoning failure — not information asymmetry.

Calibration curves — reliability diagrams showing whether a model's stated confidence tracks reality. A model that says 80% should be right about 80% of the time. Most aren't.

How we built it

The stack is Python on the backend with FastAPI, SQLite for persistence, and a Next.js frontend with Recharts for visualization.

The Polymarket integration uses two public unauthenticated APIs. The Gamma API at gamma-api.polymarket.com provides resolved market metadata — question text, resolution criteria, category, volume, and outcome prices. The CLOB API at clob.polymarket.com provides price history via the /prices-history endpoint, queried with a one-week interval and hourly fidelity, giving us the full pre-resolution price trajectory for each market.

Outcome determination reads directly from outcomePrices after JSON parsing — a resolved YES market has outcomePrices[0] >= 0.99, a resolved NO has outcomePrices[0] <= 0.01. Markets that settled ambiguously are discarded.

The snapshot moment is fixed at 48 hours before closedTime for every market. This is the temporal anchor for all context assembly — price history is truncated here, and news articles are filtered to only those published before this timestamp. Lookahead bias prevention is enforced at the article level: any article without a parseable publication date is discarded rather than included.

News context is fetched via the Exa API using the market question as the search query, with endPublishedDate set to the snapshot timestamp. We take the top five results by relevance, truncating each to 400 characters to stay within context limits across all models.

LLM inference runs all three models in parallel using asyncio.gather. Every model receives an identical system prompt defining the superforecaster role and requiring JSON-only output, and an identical user prompt populated from the context bundle. Temperature is set to 0.1 across all models for consistency. Responses are parsed and validated, with raw outputs stored for debugging. Batches of ten markets are processed at a time with a five-second cooldown between batches to respect rate limits.

Brier scores are computed immediately after estimates are stored. Calibration bins are precomputed across ten probability buckets for each model and category combination, enabling fast frontend queries without runtime aggregation.

Challenges we ran into

Lookahead bias is subtle and persistent. The obvious fix — filtering news by publication date — turned out to be incomplete. Some articles are backdated. Some aggregators republish old content with new timestamps. Some news APIs don't surface publication dates at all. We implemented a strict discard policy for any article where the date was missing or unparseable, and added a news_quality flag to every context bundle so we could audit which markets had thin context.

outcomePrices and clobTokenIds come back as stringified JSON. Not documented prominently. Both fields look like strings in the API response and silently fail if you try to index them directly. One json.loads() call fixes it, but this cost us an hour of confusing null results early on.

LLM JSON compliance degrades under pressure. At low temperature, models occasionally wrap their JSON in markdown fences, prepend a sentence of preamble, or return malformed probability values like "~0.7" or "70%" instead of 0.7. We wrote a cleaning layer that strips fences, attempts multiple parse strategies, and validates the probability field before storing. Models that fail all strategies get a null estimate and an error logged rather than a bad number silently entering the scoring.

The Brier score leaderboard is unintuitive at first glance. Lower is better, which is backwards from every other leaderboard people are used to seeing. We spent real time on the UI labeling and color coding to make this clear — green for low scores, red for high, with explicit "lower = better" annotations — because getting this wrong in a demo kills the narrative.

Market selection matters enormously. Early runs included thin markets with under $5,000 volume where a few trades could move the price 20 points. These produced noisy context bundles and unrepresentative scores. We settled on a $10,000 volume floor, a maximum spread of 10%, and a snapshot price between 5% and 95% — filtering out markets already at near-certainty where there's nothing to reason about.

Accomplishments that we're proud of

Running a clean backtest on 200+ resolved markets with genuine lookahead bias prevention is harder than it sounds, and we got it right. Every estimate in the database reflects only information that was available at the snapshot moment. The benchmark is honest.

The disagreement feed turned out to be more interesting than we expected. Seeing the specific markets where GPT-4o systematically diverged from crowd consensus — and then checking whether the crowd or the model was right — reveals patterns that aggregate Brier scores hide. GPT-4o overestimates on economics markets specifically. Claude underestimates on sports markets involving recent trades. These are findings, not just leaderboard positions.

The calibration curves are the thing we're most proud of technically. Most LLM evaluations report accuracy on a benchmark. Calibration curves show you something deeper — whether a model knows what it doesn't know. A model that says 90% and is right 60% of the time is worse in practice than its accuracy score suggests, because you'd overtrade on its confidence. Prophit surfaces this directly.

What we learned

The market price is an incredibly strong prior. In the majority of markets, the model closest to the market price at the snapshot moment scores the best Brier score. This isn't surprising — it's exactly what efficient market theory predicts — but it has a practical implication: a model that simply parrots the crowd will look well-calibrated in aggregate. The interesting signal is in the residuals: the markets where a model diverges from the crowd and is right.

News context quality drives estimate quality more than model capability in most categories. When a market had rich pre-resolution news coverage, all three models produced reasonable estimates. When coverage was thin — obscure local elections, niche science questions — estimates scattered widely regardless of model. This suggests that for near-future events, information availability is the binding constraint, not reasoning capability.

Brier score aggregates hide domain-specific strengths. Overall leaderboard rankings are nearly meaningless — the right question is which model to trust for which category of question, and that answer differs substantially across politics, science, sports, and economics.

What's next for Prophit

The natural extension is live inference on active markets — running the same pipeline on unresolved markets and watching scores update as markets close over time. This turns Prophit from a historical benchmark into a continuously-updating evaluation that gets richer with every resolution event. The disagreement feed on live markets becomes a genuinely useful signal: markets where LLM consensus diverges from crowd odds are worth paying attention to.

The deeper research question is whether the disagreement feed has predictive value. When a well-calibrated model disagrees with the market, does the market tend to move toward the model's estimate over the following hours? We have the infrastructure to test this now.

Domain expansion is straightforward given the architecture — sports markets via the Polymarket Sports API, a broader set of models including Llama and Grok, and finer-grained category decomposition within politics and science.

The long-horizon version of this project is a living leaderboard that updates continuously as new models ship and new markets resolve — the benchmark that Alpha Arena tried to be, built on ground truth that actually resolves.

Built With

c++
polymarket
python

Updates

Mahanth Komuravelli started this project — Mar 14, 2026 08:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.