BullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.
- Public viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
- Updated: 2026-03-07
- Added benchmark runs for:
meta-llama/llama-4-maverickmeta-llama/llama-4-scoutmeta-llama/llama-3.1-8b-instruct
- Added canonical launch-date metadata for the Meta rows and propagated it across the published v1/v2 launch views so the release-date charts include them.
- Refreshed the published viewer datasets and README chart screenshots so GitHub and the live viewer reflect the same latest data.
- Full details: CHANGELOG.md
100new nonsense questions in the v2 set.- Domain-specific question coverage across
5domains:software(40),finance(15),legal(15),medical(15),physics(15). - New visualizations in the v2 viewer, including:
- Detection Rate by Model (stacked mix bars)
- Domain Landscape (overall vs domain detection mix)
- Detection Rate Over Time
- Do Newer Models Perform Better?
- Does Thinking Harder Help? (tokens/cost toggle)
The screenshots below follow the same flow as viewer/index.v2.html, starting with the main chart.
Primary leaderboard-style view showing each model's green/amber/red split.
Detection mix by domain to compare overall performance vs each domain at a glance.
Release-date trend view focused on Anthropic, OpenAI, and Google.
All-model scatter by release date vs. green rate.
Reasoning scatter (tokens/cost toggle in the viewer) vs. green rate.
100nonsense prompts total.5domain groups:software(40),finance(15),legal(15),medical(15),physics(15).13nonsense techniques (for example:plausible_nonexistent_framework,misapplied_mechanism,nested_nonsense,specificity_trap).3-judge panel aggregation (anthropic/claude-sonnet-4.6,openai/gpt-5.2,google/gemini-3.1-pro-preview) usingfullpanel mode +meanaggregation.- Published v2 leaderboard currently includes
80model/reasoning rows.
Clear Pushback: the model clearly rejects the broken premise.Partial Challenge: the model flags issues but still engages the bad premise.Accepted Nonsense: the model treats the nonsense as valid.
- Set API keys:
export OPENROUTER_API_KEY=your_key_here
export OPENAI_API_KEY=your_openai_key_here # required only for models routed to OpenAI
export OPENAI_PROJECT=proj_xxx # optional: force OpenAI requests to a specific project
export OPENAI_ORGANIZATION=org_xxx # optional: force organization contextProvider routing is configured per model via collect.model_providers and
grade.model_providers in config (default is OpenRouter), for example:
{"*":"openrouter","gpt-5.3":"openai"}.
- Run collection + primary judge (Claude by default):
./scripts/run_end_to_end.sh- Run v2 end-to-end and publish into the dedicated v2 dataset:
./scripts/run_end_to_end.sh --config config.v2.json --viewer-output-dir data/v2/latest --with-additional-judges- Optionally run the default config end-to-end (publishes to
data/latest):
./scripts/run_end_to_end.sh --with-additional-judges- Open the viewer:
- Published viewer (latest): https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html
- Local viewer (optional):
./scripts/run_end_to_end.sh --with-additional-judges --serve --port 8877Then open http://localhost:8877/viewer/index.v2.html.
Use the Benchmark Version dropdown in the filters panel to switch between published datasets (for example v1 and v2).
- v1 dataset remains in
data/latest. - v2 dataset is published in
data/v2/latest. - v2 question set comes from
drafts/new-questions.mdviascripts/build_questions_v2_from_draft.py. - Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline).
- Release notes and notable changes are tracked in
CHANGELOG.md.
- Technical Guide: pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars.
- Changelog: v1 to v2 release notes and publish-history highlights.
- Question Set: benchmark questions and scoring metadata.
- Question Set v2: v2 question pool generated from
drafts/new-questions.md. - Config: default model/pipeline settings.
- Config v2: v2-ready config (uses
questions.v2.json).
- This README is intentionally audience-facing.
- Technical and maintainer-oriented content lives in
docs/TECHNICAL.md.
MIT. See LICENSE.





