Team Brier Patch entry for the Prophet Hacks AI forecasting hackathon. Forecasting Track. Build window May 16-17, 2026; live evaluation through May 31 on Prophet Arena.
The agent is deployed on Google Cloud Run and registered with the Prophet Arena eval server:
POST https://prophet-hacks-agent-651046060481.us-central1.run.app/predictGET https://prophet-hacks-agent-651046060481.us-central1.run.app/health
Accepts one event per request, returns the probability distribution across that event's outcomes. See "Agent contract" below.
- Python 3.11+
ai-prophet-core— SDK and API clientai-prophet—prophetCLI (retrieve / predict / evaluate)- FastAPI + uvicorn for the live
/predictendpoint - Anthropic Claude + OpenAI + Google Gemini for the cross-vendor LLM ensemble
- Polymarket (gamma-api) and ESPN scoreboard for market-anchored priors
- NWS forecasts (weather) and yfinance (crypto) for typed external priors
For binary events:
- Fetch the Kalshi market by
market_ticker. - Tail-anchor triage: confident liquid market (vol >= $500, price outside [0.05, 0.95]) returns directly with a 3% safety shrink.
- Cross-venue agreement gate (politics / world / company / etc.):
- In the safe band ([0.20, 0.80] with Kalshi vol >= $10k): fetch Polymarket and only blend when |kalshi - poly| > 0.03. When the venues agree, there's no signal in the cross-reference.
- Outside the safe band: always volume-weighted-blend with Kalshi.
- Volume-weighted shrinkage toward 0.5.
- If no market signal, category priors (NWS for weather, yfinance for crypto, ESPN for sports, Manifold elsewhere).
- Final fallback: cross-vendor LLM ensemble (Claude Opus
extended-thinking + GPT-5-mini + Gemini 2.5 Flash) with shared web
search (Anthropic anchors search; OpenAI + Gemini receive its
findings as
search_context). Median aggregation. Three-tier tail-aware shrinkage:decisive(alpha=0.02 when rationale describes a resolved outcome),grounded(alpha=0.05 when it cites current data),speculative(alpha=0.15 base-rate reasoning), with extra alpha at |p - 0.5| > 0.40 and a hard cap at alpha=0.50. If the whole ensemble fails, retry once with web search disabled before falling to uniform 0.5. - Market sanity guardrail: if final p deviates >0.30 from a deep liquid Kalshi mid, anchor 60/40 toward market.
- Path-stratified calibration when a fitted table is present (GCS,
refit daily during eval, ±0.05 shift cap). The pipeline branch is
stamped at the producer (
tail-anchor,kalshi-anchor,kalshi+poly-blend,guardrail-anchored,prior,llm-{decisive,grounded,speculative}, etc.) rather than re-derived from the rationale text, so composed rationales don't corrupt stratification. Per-bucket yes-rates are Beta-Bernoulli shrunk toward the bucket'smean_pwith prior strengthN_0=10so small-N buckets behave sensibly (min_n=3).
For multi-outcome events (3+ outcomes):
- Kalshi event lookup via
/trade-api/v2/events/{event_ticker} ?with_nested_markets=true. Requiresmutually_exclusive=true, maps each child market to one of the event's outcomes, requires=60% coverage.
- Polymarket event lookup with the same coverage contract.
- Capped volume-weighted blend of the two (KALSHI_POLY_MAX_WEIGHT = 0.75) when both are available; otherwise use whichever returned.
- LLM ensemble with explicit
p_yes = P(outcomes[0])framing and two worked top-K examples in the prompt; retry without web search on total failure. - Uniform 1/N as the final fallback.
All paths produce a {probabilities: [...]} response strictly summing
to 1, with market names matching the event's outcomes list exactly.
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env # fill in API keys (see Environment below)# Pull a slate of sample events (no API key needed for dataset retrieval)
.venv/bin/prophet forecast retrieve --dataset sample-sports -o events.json
# Run our agent against those events via the prophet CLI
.venv/bin/prophet forecast predict \
--events events.json \
--local agent.predict \
-o submission.json
cat submission.jsonuvicorn agent.predict:app --host 0.0.0.0 --port 8000
# In another shell, exercise the endpoint via prophet CLI:
.venv/bin/prophet forecast predict \
--events events.json \
--agent-url http://localhost:8000/predict
# Or hit /predict directly with curl:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d @events.jsonOne-shot script (recommended): runs a clean end-to-end evaluation against a sample event slate.
git clone https://github.com/BenRossJenkins/prophet-hacks-agent.git
cd prophet-hacks-agent
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-proj-...
export GEMINI_API_KEY=AIza...
bash scripts/evaluate_agent.shscripts/evaluate_agent.sh handles Python version checks, virtualenv
setup, dependency installation, dataset retrieval, prediction, and a
human-readable summary of outputs. Approximate cost is $0.25 for the
default 3-event smoke test; pass EVAL_EVENT_COUNT= (empty) to run the
full sample slate.
Manual setup, if you prefer step-by-step:
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .
export ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GEMINI_API_KEY=...
# Option A: run the agent as a local module via the official prophet CLI
prophet forecast retrieve --dataset sample-sports -o events.json
prophet forecast predict --events events.json --local agent.predict -o predictions.json
# Option B: run as an HTTP server (matches the deployed Cloud Run setup)
uvicorn agent.predict:app --host 0.0.0.0 --port 8000
prophet forecast predict --events events.json --agent-url http://localhost:8000/predictThe container image is also available — docker build -t agent . then
docker run -p 8000:8000 -e ANTHROPIC_API_KEY=... agent reproduces the
Cloud Run deployment locally.
Completion rate is multiplicative on the final score, so a hung or crashed request hurts as much as a wrong prediction. The agent is engineered so no single failure can break a /predict response:
- Ensemble hard deadline. Each call to
llm_forecast_ensembleis capped at 8 minutes (well below the 10-min per-event budget). Vendors still running at the deadline are abandoned and the median of whatever responses arrived is returned. A single hung LLM vendor cannot consume the whole budget. - Retry-without-search on total failure. When the whole ensemble returns None (all 3 vendors failed simultaneously, typically a search tool rate-limit), we retry once with web search disabled. Base chat-completion APIs rate-limit independently of search tools, so this converts a 0.5-fallback into a real prediction roughly 50% of the time in stress tests. The eval server does NOT retry timed-out requests on our behalf, so this is on us.
- Multi-tier market fallback chain. When the LLM stack is
unreachable, the agent falls through tail-anchor → Polymarket blend →
category prior (NWS / yfinance / ESPN / Manifold) → uniform 0.5. Every
external call is wrapped to return
Noneon failure rather than raise, so any tier can fail without blocking the next. - Probabilities-only response contract is enforced at exactly one
place. Every code path produces its final distribution via
_wrap_binary(binary events) or_normalize_distribution(multi-outcome), so the wire response is always well-formed JSON with probabilities summing to 1 and markets matching the event's outcomes. - Bounded calibration shift. Any single calibration adjustment is
capped at ±0.05 from the raw prediction so a noisy small-N bucket
cannot yank a confident forecast off track. Per-bucket yes-rates
are additionally Beta-Bernoulli shrunk toward the bucket's mean
prediction so a 3-event "all yes" bucket doesn't output
1.0. - Diff-sanity guard on calibration publishes. The daily refit
loads the previously-published table from GCS, compares each new
small-N bucket against its predecessor, and refuses to publish
(exit 3) when any small-N bucket moved by more than 0.20. Operators
override with
--skip-diff-sanityonly after inspecting the data. - GCS-mirrored prediction log. Every prediction is also written as a
per-event JSON object to GCS so the daily calibration refit job has a
durable read source — predictions survive Cloud Run container
restarts and post-hoc audit is possible. The log records the
producing pipeline branch and the agent
versionfor each entry so post-eval analysis can attribute Brier deltas to specific versions.
The calibration table is refit nightly from resolved questions over the eval window; this is a parameter update from observed data, not a mid-eval code change. Architecture, prompts, and pipeline structure are frozen at submission.
The /predict endpoint matches the Prophet Arena spec
(https://prophetarena.co/developer):
Input (one event per request):
{
"event_ticker": "...",
"market_ticker": "...",
"title": "Who will win: A or B?",
"category": "Sports",
"rules": "Resolves to the winner.",
"close_time": "2026-05-25T23:59:59Z",
"outcomes": ["A", "B"]
}Output:
{
"probabilities": [
{"market": "A", "probability": 0.62},
{"market": "B", "probability": 0.38}
]
}- Probabilities sum to 1 across the event's outcomes
- Each
marketmatches one of the event'soutcomesexactly p_yesandrationaleare included as extra fields for our own logging; the eval server ignores them
Brier scoring (lower is better) per the developer docs. Each outcome's squared error contributes to the per-event Brier.
agent/ forecasting logic + FastAPI app
predict.py pipeline orchestration, /predict, /health
kalshi.py Kalshi market client
polymarket.py Polymarket binary + multi-outcome lookup
sports.py ESPN moneyline-derived prior
weather.py NWS-backed prior
financials.py yfinance-backed crypto prior
manifold.py Manifold fallback prior
llm.py multi-vendor ensemble with shared web search
calibrate.py path-stratified calibration (GCS-loaded)
prediction_log.py defensive append-only log (GCS-mirrored)
scripts/ operational utilities
daily_calibration.sh daily cron wrapper (resolve + refit + push)
resolve_predictions.py marks resolved predictions from PA + Kalshi
fit_calibration.py fits path-stratified calibration table
backtest.py local Brier evaluation against a fixture
capture_live_snapshots.py / resolve_captures.py
daily capture flow for ongoing fixture growth
build_backtest_fixture.py / build_diverse_fixture.py
reproducible candlestick-based fixtures
tests/ unit tests; tests/fixtures/ for backtest data
320 tests under pytest tests/.
See .env.example. Required for the full pipeline:
ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY— LLM vendorsPA_SERVER_URL=https://api.aiprophet.dev— Prophet Arena basePA_SERVER_API_KEY=prophet_...— issued by organizers, X-API-Key headerPA_TEAM_NAME=Brier Patch— registered team name
Optional:
PREDICTION_LOG_PATH(defaultdata/predictions.jsonl)PREDICTION_LOG_GCS_PREFIX=gs://...— mirror every prediction to GCSCALIBRATION_GCS_URI=gs://.../calibration.json— daily refit sourceFORECAST_MODEL=claude-opus-4-7— override the single-model path
See CLAUDE.md for the in-source architecture summary, defensive-
degradation rules, the path-stratified calibration design, and the
running gotchas list (Kalshi DNS, Polymarket DNS, the empty
LLM-denylist policy, etc.).