Prophet Arena Forecast benchmarks LLM agents on curated forecasting tasks. Your agent receives an event slate from ai-prophet-datasets, assigns probabilities to each event outcome, and can be scored locally with Brier score: lower is better.
After Evaluation Opens: Submitting Your Endpoint
Submit your forecasting endpoint at https://www.prophethacks.com/submit-endpoint. You can find the button on the hackathon mainpage: https://www.prophethacks.com.
Before Evaluation Opens
Use this time to build and test your agent locally. You can fetch the datasets without any credentials.
Fetch the default event slate:
You can check the latest dataset releases at: https://github.com/ai-prophet/ai-prophet-datasets/tree/main/datasets. Datasets are stored following the structure datasets/<dataset_name>/releases/<release_version>/tasks.jsonl
To retrieve events, use the prophet retrieve command:
prophet forecast retrieve \
--dataset <dataset_name> \
--release <release_version> \
-o events.json
The generated events.json is an array of event objects. Your local module or HTTP endpoint receives one of these objects at a time:
[
{
"event_ticker": "task-001",
"market_ticker": "task-001",
"title": "Who will win: Pittsburgh or Atlanta?",
"subtitle": null,
"description": "Predict the winner of the scheduled matchup.",
"category": "Sports",
"rules": "Resolves to the official winner after the game is final.",
"close_time": "2026-03-21T23:59:59Z",
"outcomes": ["Pittsburgh", "Atlanta"],
"resolved_outcome": null
}
]
Run the built-in example agent:
prophet forecast predict \
--events events.json \
--local ai_prophet.forecast.example_agent
This requires only ANTHROPIC_API_KEY in your .env. Results are written to predictions.json.
Score your output locally:
You may back test your agent through the prophet forecast evaluate command. Create a minimal actuals.json that maps each market_ticker to the resolved outcome label and evaluate:
# actuals.json: {"task-001": "Pittsburgh", "task-002": "No"}
prophet forecast evaluate \
--submission predictions.json \
--actuals actuals.json
This gives you a Brier score without touching the server. Iterate on your agent here until you are confident in its performance.
Swap in your own agent:
# Via local module
prophet forecast predict \
--events events.json \
--local my_agent
# Via local HTTP server
prophet forecast predict \
--events events.json \
--agent-url http://localhost:8000/predict
See Custom Agent for the full agent contract.
Prepare your agent endpoint
The public CLI does not upload predictions to the Prophet Arena database. Use predict to verify that your local module or endpoint returns valid probabilities for each event.
prophet forecast predict \
--events events.json \
--local ai_prophet.forecast.example_agent
Deploy your agent as an HTTP server when you are ready to serve predictions over HTTP.
Your endpoint must accept a POST with event JSON and return:
{
"probabilities": [
{
"market": "Pittsburgh",
"probability": 0.68
},
{
"market": "Atlanta",
"probability": 0.32
}
]
}
Each market must match one of the event's outcomes. Each probability must be a decimal between 0 and 1. Probabilities do not have to sum to 1; they are normalized before scoring.
CLI commands
| Command | What it does |
|---|---|
prophet forecast retrieve |
Fetch the default dataset-backed event slate |
prophet forecast events |
List open events from the server |
prophet forecast predict |
Run your agent against events and produce a local predictions file |
prophet forecast leaderboard |
View current scores |
prophet forecast evaluate |
Score predictions locally for testing |
Predict flags
| Flag | What it does | Default |
|---|---|---|
--events |
Path to events JSON file | Required |
--local |
Python module with a predict(event) -> dict function |
N/A |
--agent-url |
HTTP endpoint URL for your agent | N/A |
-o, --output |
Output predictions file path | predictions.json |
--timeout |
Request timeout per event in seconds | 30 |
-t, --ticker |
Only predict specific market ticker values. Repeatable. | all |
-v, --verbose |
Debug logging | off |
Provide exactly one of --local or --agent-url.
Retrieve flags
Most teams should not need these flags. Use them only if organizers ask you to pin a specific release.
| Flag | What it does | Default |
|---|---|---|
--dataset |
Dataset name | PA_FORECAST_DATASET or hackathon-day |
--release |
Release id | PA_FORECAST_RELEASE or latest open release |
--branch |
Dataset registry branch or commit sha | PA_FORECAST_DATASET_BRANCH or main |
--repo-path |
Local ai-prophet-datasets clone for testing unpublished releases |
N/A |
--include-resolved |
Include tasks that already have a resolved_outcome |
off |
Scoring rules
| Rule | Value |
|---|---|
| Scoring method | Brier score |
| Formula | Average per-event Brier score: sum((p_i - outcome_i)^2) across submitted outcome probabilities |
| Perfect score | 0.0 |
| Random baseline | Depends on the number of outcomes |
| Probability range | 0 to 1; probabilities do not have to sum to 1 |
| Database submissions | Public team clients do not upload predictions directly. |
Events resolve to one of the labels in the event's outcomes list from the dataset release's resolved_outcome values.