AI Prophet

Prophet Arena Forecast benchmarks LLM agents on curated forecasting tasks. Your agent receives an event slate from ai-prophet-datasets, assigns probabilities to each event outcome, and can be scored locally with Brier score: lower is better.

After Evaluation Opens: Submitting Your Endpoint

Submit your forecasting endpoint at https://www.prophethacks.com/submit-endpoint. You can find the button on the hackathon mainpage: https://www.prophethacks.com.

Before Evaluation Opens

Use this time to build and test your agent locally. You can fetch the datasets without any credentials.

Fetch the default event slate:

You can check the latest dataset releases at: https://github.com/ai-prophet/ai-prophet-datasets/tree/main/datasets. Datasets are stored following the structure datasets/<dataset_name>/releases/<release_version>/tasks.jsonl

To retrieve events, use the prophet retrieve command:

prophet forecast retrieve \
  --dataset <dataset_name> \
  --release <release_version> \
  -o events.json

The generated events.json is an array of event objects. Your local module or HTTP endpoint receives one of these objects at a time:

[
  {
    "event_ticker": "task-001",
    "market_ticker": "task-001",
    "title": "Who will win: Pittsburgh or Atlanta?",
    "subtitle": null,
    "description": "Predict the winner of the scheduled matchup.",
    "category": "Sports",
    "rules": "Resolves to the official winner after the game is final.",
    "close_time": "2026-03-21T23:59:59Z",
    "outcomes": ["Pittsburgh", "Atlanta"],
    "resolved_outcome": null
  }
]

Run the built-in example agent:

prophet forecast predict \
  --events events.json \
  --local ai_prophet.forecast.example_agent

This requires only ANTHROPIC_API_KEY in your .env. Results are written to predictions.json.

Score your output locally:

You may back test your agent through the prophet forecast evaluate command. Create a minimal actuals.json that maps each market_ticker to the resolved outcome label and evaluate:

# actuals.json: {"task-001": "Pittsburgh", "task-002": "No"}
prophet forecast evaluate \
  --submission predictions.json \
  --actuals actuals.json

This gives you a Brier score without touching the server. Iterate on your agent here until you are confident in its performance.

Swap in your own agent:

# Via local module
prophet forecast predict \
  --events events.json \
  --local my_agent

# Via local HTTP server
prophet forecast predict \
  --events events.json \
  --agent-url http://localhost:8000/predict

See Custom Agent for the full agent contract.

Prepare your agent endpoint

The public CLI does not upload predictions to the Prophet Arena database. Use predict to verify that your local module or endpoint returns valid probabilities for each event.

prophet forecast predict \
  --events events.json \
  --local ai_prophet.forecast.example_agent

Deploy your agent as an HTTP server when you are ready to serve predictions over HTTP.

Your endpoint must accept a POST with event JSON and return:

{
  "probabilities": [
    {
      "market": "Pittsburgh",
      "probability": 0.68
    },
    {
      "market": "Atlanta",
      "probability": 0.32
    }
  ]
}

Each market must match one of the event's outcomes. Each probability must be a decimal between 0 and 1. Probabilities do not have to sum to 1; they are normalized before scoring.

CLI commands

Command	What it does
`prophet forecast retrieve`	Fetch the default dataset-backed event slate
`prophet forecast events`	List open events from the server
`prophet forecast predict`	Run your agent against events and produce a local predictions file
`prophet forecast leaderboard`	View current scores
`prophet forecast evaluate`	Score predictions locally for testing

Predict flags

Flag	What it does	Default
`--events`	Path to events JSON file	Required
`--local`	Python module with a `predict(event) -> dict` function	N/A
`--agent-url`	HTTP endpoint URL for your agent	N/A
`-o, --output`	Output predictions file path	`predictions.json`
`--timeout`	Request timeout per event in seconds	`30`
`-t, --ticker`	Only predict specific market ticker values. Repeatable.	`all`
`-v, --verbose`	Debug logging	`off`

Provide exactly one of --local or --agent-url.

Retrieve flags

Most teams should not need these flags. Use them only if organizers ask you to pin a specific release.

Flag	What it does	Default
`--dataset`	Dataset name	`PA_FORECAST_DATASET` or `hackathon-day`
`--release`	Release id	`PA_FORECAST_RELEASE` or latest open release
`--branch`	Dataset registry branch or commit sha	`PA_FORECAST_DATASET_BRANCH` or `main`
`--repo-path`	Local `ai-prophet-datasets` clone for testing unpublished releases	N/A
`--include-resolved`	Include tasks that already have a `resolved_outcome`	`off`

Scoring rules

Rule	Value
Scoring method	Brier score
Formula	Average per-event Brier score: `sum((p_i - outcome_i)^2)` across submitted outcome probabilities
Perfect score	`0.0`
Random baseline	Depends on the number of outcomes
Probability range	`0` to `1`; probabilities do not have to sum to `1`
Database submissions	Public team clients do not upload predictions directly.

Events resolve to one of the labels in the event's outcomes list from the dataset release's resolved_outcome values.

Build agents on AI Prophet

Forecast

Trade

Forecast Quick Start