Prophet Hacks 2026: Forecasting Agent

Team Brier Patch entry for the Prophet Hacks AI forecasting hackathon. Forecasting Track. Build window May 16-17, 2026; live evaluation through May 31 on Prophet Arena.

Team

Live endpoint

The agent is deployed on Google Cloud Run and registered with the Prophet Arena eval server:

POST https://prophet-hacks-agent-651046060481.us-central1.run.app/predict
GET https://prophet-hacks-agent-651046060481.us-central1.run.app/health

Accepts one event per request, returns the probability distribution across that event's outcomes. See "Agent contract" below.

Stack

Python 3.11+
ai-prophet-core — SDK and API client
ai-prophet — prophet CLI (retrieve / predict / evaluate)
FastAPI + uvicorn for the live /predict endpoint
Anthropic Claude + OpenAI + Google Gemini for the cross-vendor LLM ensemble
Polymarket (gamma-api) and ESPN scoreboard for market-anchored priors
NWS forecasts (weather) and yfinance (crypto) for typed external priors

Pipeline

For binary events:

Fetch the Kalshi market by market_ticker.
Tail-anchor triage: confident liquid market (vol >= $500, price outside [0.05, 0.95]) returns directly with a 3% safety shrink.
Cross-venue agreement gate (politics / world / company / etc.):
- In the safe band ([0.20, 0.80] with Kalshi vol >= $10k): fetch Polymarket and only blend when |kalshi - poly| > 0.03. When the venues agree, there's no signal in the cross-reference.
- Outside the safe band: always volume-weighted-blend with Kalshi.
Volume-weighted shrinkage toward 0.5.
If no market signal, category priors (NWS for weather, yfinance for crypto, ESPN for sports, Manifold elsewhere).
Final fallback: cross-vendor LLM ensemble (Claude Opus extended-thinking + GPT-5-mini + Gemini 2.5 Flash) with shared web search (Anthropic anchors search; OpenAI + Gemini receive its findings as search_context). Median aggregation. Three-tier tail-aware shrinkage: decisive (alpha=0.02 when rationale describes a resolved outcome), grounded (alpha=0.05 when it cites current data), speculative (alpha=0.15 base-rate reasoning), with extra alpha at |p - 0.5| > 0.40 and a hard cap at alpha=0.50. If the whole ensemble fails, retry once with web search disabled before falling to uniform 0.5.
Market sanity guardrail: if final p deviates >0.30 from a deep liquid Kalshi mid, anchor 60/40 toward market.
Path-stratified calibration when a fitted table is present (GCS, refit daily during eval, ±0.05 shift cap). The pipeline branch is stamped at the producer (tail-anchor, kalshi-anchor, kalshi+poly-blend, guardrail-anchored, prior, llm-{decisive,grounded,speculative}, etc.) rather than re-derived from the rationale text, so composed rationales don't corrupt stratification. Per-bucket yes-rates are Beta-Bernoulli shrunk toward the bucket's mean_p with prior strength N_0=10 so small-N buckets behave sensibly (min_n=3).

For multi-outcome events (3+ outcomes):

Kalshi event lookup via /trade-api/v2/events/{event_ticker} ?with_nested_markets=true. Requires mutually_exclusive=true, maps each child market to one of the event's outcomes, requires

=60% coverage.
Polymarket event lookup with the same coverage contract.
Capped volume-weighted blend of the two (KALSHI_POLY_MAX_WEIGHT = 0.75) when both are available; otherwise use whichever returned.
LLM ensemble with explicit p_yes = P(outcomes[0]) framing and two worked top-K examples in the prompt; retry without web search on total failure.
Uniform 1/N as the final fallback.

All paths produce a {probabilities: [...]} response strictly summing to 1, with market names matching the event's outcomes list exactly.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env       # fill in API keys (see Environment below)

Run locally — full pipeline

# Pull a slate of sample events (no API key needed for dataset retrieval)
.venv/bin/prophet forecast retrieve --dataset sample-sports -o events.json

# Run our agent against those events via the prophet CLI
.venv/bin/prophet forecast predict \
    --events events.json \
    --local agent.predict \
    -o submission.json

cat submission.json

Run as an HTTP server (matches the live deployment)

uvicorn agent.predict:app --host 0.0.0.0 --port 8000

# In another shell, exercise the endpoint via prophet CLI:
.venv/bin/prophet forecast predict \
    --events events.json \
    --agent-url http://localhost:8000/predict

# Or hit /predict directly with curl:
curl -X POST http://localhost:8000/predict \
    -H "Content-Type: application/json" \
    -d @events.json

For evaluators / organizers

One-shot script (recommended): runs a clean end-to-end evaluation against a sample event slate.

git clone https://github.com/BenRossJenkins/prophet-hacks-agent.git
cd prophet-hacks-agent

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-proj-...
export GEMINI_API_KEY=AIza...

bash scripts/evaluate_agent.sh

scripts/evaluate_agent.sh handles Python version checks, virtualenv setup, dependency installation, dataset retrieval, prediction, and a human-readable summary of outputs. Approximate cost is $0.25 for the default 3-event smoke test; pass EVAL_EVENT_COUNT= (empty) to run the full sample slate.

Manual setup, if you prefer step-by-step:

python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .

export ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GEMINI_API_KEY=...

# Option A: run the agent as a local module via the official prophet CLI
prophet forecast retrieve --dataset sample-sports -o events.json
prophet forecast predict --events events.json --local agent.predict -o predictions.json

# Option B: run as an HTTP server (matches the deployed Cloud Run setup)
uvicorn agent.predict:app --host 0.0.0.0 --port 8000
prophet forecast predict --events events.json --agent-url http://localhost:8000/predict

The container image is also available — docker build -t agent . then docker run -p 8000:8000 -e ANTHROPIC_API_KEY=... agent reproduces the Cloud Run deployment locally.

Reliability strategy

Completion rate is multiplicative on the final score, so a hung or crashed request hurts as much as a wrong prediction. The agent is engineered so no single failure can break a /predict response:

Ensemble hard deadline. Each call to llm_forecast_ensemble is capped at 8 minutes (well below the 10-min per-event budget). Vendors still running at the deadline are abandoned and the median of whatever responses arrived is returned. A single hung LLM vendor cannot consume the whole budget.
Retry-without-search on total failure. When the whole ensemble returns None (all 3 vendors failed simultaneously, typically a search tool rate-limit), we retry once with web search disabled. Base chat-completion APIs rate-limit independently of search tools, so this converts a 0.5-fallback into a real prediction roughly 50% of the time in stress tests. The eval server does NOT retry timed-out requests on our behalf, so this is on us.
Multi-tier market fallback chain. When the LLM stack is unreachable, the agent falls through tail-anchor → Polymarket blend → category prior (NWS / yfinance / ESPN / Manifold) → uniform 0.5. Every external call is wrapped to return None on failure rather than raise, so any tier can fail without blocking the next.
Probabilities-only response contract is enforced at exactly one place. Every code path produces its final distribution via _wrap_binary (binary events) or _normalize_distribution (multi-outcome), so the wire response is always well-formed JSON with probabilities summing to 1 and markets matching the event's outcomes.
Bounded calibration shift. Any single calibration adjustment is capped at ±0.05 from the raw prediction so a noisy small-N bucket cannot yank a confident forecast off track. Per-bucket yes-rates are additionally Beta-Bernoulli shrunk toward the bucket's mean prediction so a 3-event "all yes" bucket doesn't output 1.0.
Diff-sanity guard on calibration publishes. The daily refit loads the previously-published table from GCS, compares each new small-N bucket against its predecessor, and refuses to publish (exit 3) when any small-N bucket moved by more than 0.20. Operators override with --skip-diff-sanity only after inspecting the data.
GCS-mirrored prediction log. Every prediction is also written as a per-event JSON object to GCS so the daily calibration refit job has a durable read source — predictions survive Cloud Run container restarts and post-hoc audit is possible. The log records the producing pipeline branch and the agent version for each entry so post-eval analysis can attribute Brier deltas to specific versions.

The calibration table is refit nightly from resolved questions over the eval window; this is a parameter update from observed data, not a mid-eval code change. Architecture, prompts, and pipeline structure are frozen at submission.

Agent contract

The /predict endpoint matches the Prophet Arena spec (https://prophetarena.co/developer):

Input (one event per request):

{
  "event_ticker": "...",
  "market_ticker": "...",
  "title": "Who will win: A or B?",
  "category": "Sports",
  "rules": "Resolves to the winner.",
  "close_time": "2026-05-25T23:59:59Z",
  "outcomes": ["A", "B"]
}

Output:

{
  "probabilities": [
    {"market": "A", "probability": 0.62},
    {"market": "B", "probability": 0.38}
  ]
}

Probabilities sum to 1 across the event's outcomes
Each market matches one of the event's outcomes exactly
p_yes and rationale are included as extra fields for our own logging; the eval server ignores them

Brier scoring (lower is better) per the developer docs. Each outcome's squared error contributes to the per-event Brier.

Layout

agent/                 forecasting logic + FastAPI app
  predict.py             pipeline orchestration, /predict, /health
  kalshi.py              Kalshi market client
  polymarket.py          Polymarket binary + multi-outcome lookup
  sports.py              ESPN moneyline-derived prior
  weather.py             NWS-backed prior
  financials.py          yfinance-backed crypto prior
  manifold.py            Manifold fallback prior
  llm.py                 multi-vendor ensemble with shared web search
  calibrate.py           path-stratified calibration (GCS-loaded)
  prediction_log.py      defensive append-only log (GCS-mirrored)

scripts/               operational utilities
  daily_calibration.sh   daily cron wrapper (resolve + refit + push)
  resolve_predictions.py marks resolved predictions from PA + Kalshi
  fit_calibration.py     fits path-stratified calibration table
  backtest.py            local Brier evaluation against a fixture
  capture_live_snapshots.py / resolve_captures.py
                         daily capture flow for ongoing fixture growth
  build_backtest_fixture.py / build_diverse_fixture.py
                         reproducible candlestick-based fixtures

tests/                 unit tests; tests/fixtures/ for backtest data

320 tests under pytest tests/.

Environment

See .env.example. Required for the full pipeline:

ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY — LLM vendors
PA_SERVER_URL=https://api.aiprophet.dev — Prophet Arena base
PA_SERVER_API_KEY=prophet_... — issued by organizers, X-API-Key header
PA_TEAM_NAME=Brier Patch — registered team name

Optional:

PREDICTION_LOG_PATH (default data/predictions.jsonl)
PREDICTION_LOG_GCS_PREFIX=gs://... — mirror every prediction to GCS
CALIBRATION_GCS_URI=gs://.../calibration.json — daily refit source
FORECAST_MODEL=claude-opus-4-7 — override the single-model path

Architecture conventions and gotchas

See CLAUDE.md for the in-source architecture summary, defensive- degradation rules, the path-stratified calibration design, and the running gotchas list (Kalshi DNS, Polymarket DNS, the empty LLM-denylist policy, etc.).

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
agent		agent
docs		docs
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SUBMISSION_CONTRACT.md		SUBMISSION_CONTRACT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prophet Hacks 2026: Forecasting Agent

Team

Live endpoint

Stack

Pipeline

Setup

Run locally — full pipeline

Run as an HTTP server (matches the live deployment)

For evaluators / organizers

Reliability strategy

Agent contract

Layout

Environment

Architecture conventions and gotchas

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prophet Hacks 2026: Forecasting Agent

Team

Live endpoint

Stack

Pipeline

Setup

Run locally — full pipeline

Run as an HTTP server (matches the live deployment)

For evaluators / organizers

Reliability strategy

Agent contract

Layout

Environment

Architecture conventions and gotchas

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages