Can we automate
policy evaluation?

AI may soon be capable of producing rigorous economic research. If that happens, policy evaluation could scale dramatically: highlighting what works, what fails, and what harms, far faster than human researchers alone.

We want to find out whether an autonomous system can generate, replicate, and revise empirical policy research, with everything made public.

This is an experiment in building reliable AI research systems. For a global overview, click here.

2,464Ideas+342 this week

982Papers+217 this week

17k+Matches

4%Win Rate

Last updated: April 8, 2026

Top APE papers

Loading top APE papers...

Most policies — probably millions of them globally — are never rigorously evaluated. Data is plenty but there aren't enough researchers. Could AI help? We genuinely don't know. So we're running an experiment. An AI system attempts to produce economics research at scale, , using publicly available data. Will any be good? How would we even know? Ideally, we'd want PhDs or editors of top journals to evaluate all of them. But they are busy. We run an automated tournament evaluating the papers against human benchmarks from top journals. This could help triage. Get to a "you know it when you see it" moment, faster. Most importantly, everything is : papers, code, data, failures. The more people look, the faster mistakes get caught. And we want feedback! In fact, the core thesis is that recursive self-improvement is possible and can be enhanced by human feedback. The next milestone: generate a 1000 papers, evaluate, and share lessons in a report. Can policy evaluation be automated? Or is hallucinated slop unavoidable? Let's find out!

How the Tournament Works

▼

The Tournament

Papers compete in head-to-head matches. Each day at 9:00 AM UTC, we run a round of matches. An LLM judge reads both papers and picks one as preferred (or declares a tie). The preferred paper gains rating points; the other loses points. Over time, papers consistently favored by the LLM judge rise in the rankings — though this is a noisy and potentially biased signal, not ground truth.

How Matches Work

For each match, we send both paper PDFs directly to Gemini 3.1 Flash Lite (Google's LLM). The model sees the full papers — text, figures, tables, formatting — exactly as a human reviewer would.

The judge is prompted to act as a "senior editor at a top economics journal" and evaluate papers on: identification strategy (is the causal inference credible?), novelty, policy relevance, execution quality, and appropriate scope. .

To control for position bias (LLMs sometimes prefer whichever paper they see first), we run each comparison twice with the papers swapped. A paper must win both rounds to win the match; otherwise it's a tie.

Rating System

Rankings use TrueSkill, a Bayesian rating system developed by Microsoft. Each paper has two numbers:

μ (mu) — estimated skill level. Higher = paper wins more often.
σ (sigma) — uncertainty. Decreases as the paper plays more matches.

Papers are ranked by their conservative rating (μ − 3σ), which represents the lower bound of estimated skill. This means papers need consistent wins across multiple matches — a single lucky win won't send a paper to the top.

Head-to-Head Statistics

Win counts exclude 896 ties where the judge couldn't determine a clear winner.

Prob(Human Win) answers: if we randomly pick one human paper and one APE paper, what's the probability the human paper wins according to the LLM judge? Computed from TrueSkill ratings, accounting for uncertainty — papers with fewer matches contribute less certainty. The main metric compares recent cohorts (last 25 of each); all-time (89.4%) includes all 909 AI and 43 human papers with 5+ matches.

Matchup Selection

Each day we run 50 matches (100 LLM calls with position swapping) in 10 batches of 5. Within each batch, no paper plays twice. We combine random matching with structured matching.

Important Caveats ⚠️

The ⚠️ warning icon in the leaderboard indicates AI-generated papers that have not been peer reviewed. The LLM judge is not a substitute for human peer review. AI-generated papers may contain errors, hallucinations, or fabricated results. In fact, we have found that these are very common and sometimes take a lot of effort to spot. If a paper looks too good to be true, it probably is.

The judge evaluates the PDF only, not the underlying code or data. Rankings should not be taken at face value. That's why everything is open source — code, data, and papers are all public so anyone can spot errors, report issues, and contribute improvements.

Ranking Metrics

▼

Rank

Position based on conservative rating (lower bound of estimated skill)

48h

Change in rank over the past 48 hours

Estimated skill level — higher means the paper wins more often

Uncertainty — decreases as the paper plays more matches

Cons.

Conservative rating (μ − 3σ) — papers need consistent wins to rank highly (TrueSkill)

Elo

Familiar chess-style rating for comparison (1500 = average)

Number of head-to-head matches played

Reviewed

Human expert review — Yes for journal papers (peer-reviewed), No for APE papers (pending)

Average Elo by Source

Elo Rating Distribution

Conservative Rating Distribution

TrueSkill Conservative: Top-3 Papers

Papers & Matches

Tap chart to expand

Data as of 2026-04-08 21:24:28 CET

Highest Ranked APE Papers

The Price of Subsidy Limits: Multi-Cutoff Evidence from Help to Buy's Regional Caps

Rank #24 overall100 matches

Connected Backlash: Social Networks and the Political Economy of Carbon Taxation in France

Rank #26 overall94 matches

Back to Work? Early Termination of Pandemic Unemployment Benefits and Medicaid Home Care Provider Supply

Rank #33 overall110 matches

Review Status

▼

Benchmark:

Automated Review of APE Papers

Code review:

Replication:

Referee Reviews:

Swipe to see more columns

Rank ↑	48hRank change over the last 48 hours.	Paper ↕	μEstimated skill rating (μ). Higher values indicate better research quality based on pairwise comparisons. ↕	σUncertainty (σ). Lower values mean higher confidence in the rating. ↕	Cons.Conservative Rating (μ - 3σ), adjusted for integrity penalties. Used for ranking. ↕	EloElo rating. Standard chess-like rating where 400 points difference = 90% win probability. ↕	MPMatches Played. Valid head-to-head comparisons, excluding annulled matches against papers flagged with severe issues during automated code review. ↕	Status✅ Peer reviewed · 🔎 Awaiting review · 🧐 Issues detected · 🚫 Critical errors
1	—	Dynamics of the Long Term Housing Yield: Evidence from Natural Experiments AER	40.0	1.7	35.0	2102	408	✅
2	—	The Value of Clean Water: Experimental Evidence from Rural India AER	37.6	1.6	32.9	2004	366	✅
3	—	Why is Workplace Sexual Harassment Underreported? The Value of Outside Options Amid the Threat of Retaliation AER	35.3	1.2	31.7	1911	405	✅
4	—	Vertical Integration and Cream Skimming of Profitable Referrals: The Case of Hospital-Owned Skilled Nursing Facilities AEJ: Policy	35.0	1.2	31.5	1902	416	✅
5	—	Are Complementary Policies Substitutes? Evidence from R&D Subsidies in the UK AEJ: Policy	35.4	1.4	31.4	1917	362	✅
6	▲2	Estimating the Economic Value of Zoning Reform AEJ: Policy	34.6	1.1	31.1	1883	419	✅
7	—	Punishing Financial Crimes: The Impact of Prison Sentences on Defendants and Their Colleagues AEJ: Policy	34.5	1.2	31.0	1879	364	✅
8	▼2	Plata y Plomo: How Higher Wages Expose Politicians to Criminal Violence AEJ: Policy	34.6	1.2	31.0	1885	350	✅
9	▲3	Flood Risk Mapping and the Distributional Impacts of Climate Information AEJ: Policy	34.0	1.2	30.4	1862	373	✅
10	▼1	Abundance from Abroad: Migrant Income and Long-Run Economic Development AER	33.8	1.1	30.4	1853	399	✅
11	▼1	Immigration, Innovation, and Growth AER	33.5	1.1	30.1	1841	399	✅
12	▲1	The Welfare Effects of Eligibility Expansions: Theory and Evidence from SNAP AEJ: Policy	33.5	1.2	30.0	1841	376	✅
13	▼2	Invisible Wounds: How Mental Disability Benefits Shape Veteran Well-Being AEJ: Policy	33.5	1.2	30.0	1840	344	✅
14	—	Hacked to Pieces? The Effects of Ransomware Attacks on Hospitals and Patients AEJ: Policy	33.0	1.1	29.8	1822	387	✅
15	—	Short- and Long-Term Effects of Universal Preschool: Evidence from the Arab Population in Israel AEJ: Policy	32.7	1.1	29.5	1809	377	✅
16	▲3	Can You Erase the Mark of a Criminal Record? Labor Market Impacts of Criminal Record Remediation AEJ: Policy	32.7	1.1	29.4	1810	404	✅
17	—	Adjustable Product Attributes, Indirect Network Effects, And Subsidy Design: The Case of Electric Vehicles AEJ: Policy	32.4	1.0	29.3	1796	429	✅
18	—	Collective Bargaining Rights, Policing, and Civilian Deaths AEJ: Policy	32.2	1.1	29.0	1788	421	✅
19	▲4	The Pass-Through of Retail Crime AEJ: Policy	32.2	1.1	29.0	1789	452	✅
20	▲7	Market Power and Capital Constraints AER	32.2	1.1	29.0	1788	422	✅
21	▲1	Trade Protection Along Supply Chains AEJ: Policy	32.0	1.0	29.0	1781	373	✅
22	▼1	Cooking, Health, and Daily Exposure to Pollution Spikes AEJ: Policy	32.1	1.0	29.0	1783	391	✅
23	▲5	The Unintended Consequences of Academic Leniency AEJ: Policy	31.7	1.0	28.7	1769	466	✅
24	—	The Price of Subsidy Limits: Multi-Cutoff Evidence from Help to Buy's Regional Caps APE working paper #492 (v1)	31.7	1.1	28.4	1767	100
25	▲10	Temporary Layoffs, Loss-of-Recall and Cyclical Unemployment Dynamics AER	31.6	1.1	28.3	1763	398	✅
26	▼1	Connected Backlash: Social Networks and the Political Economy of Carbon Taxation in France APE working paper #464 (v7)	31.5	1.1	28.1	1760	94
27	▲3	Harnessing Deductions to Increase Tax Compliance and Formalization AEJ: Policy	31.1	1.0	28.1	1744	414	✅
28	▼2	A Matter of Time? Measuring Effects of Public Schooling Expansions on Families AEJ: Policy	31.1	1.0	28.1	1743	427	✅
29	▲11	The Effect of Deactivating Facebook and Instagram on Users' Emotional State AEJ: Policy	31.1	1.0	28.1	1744	409	✅
30	▲2	Childhood Health Shocks and the Intergenerational Transmission of Inequality AEJ: Policy	30.8	1.0	27.8	1733	394	✅
31	▲7	Zero-Sum Thinking and the Roots of US Political Differences AER	30.6	1.0	27.7	1725	404	✅
32	▲14	Cross-State Strategic Voting AEJ: Policy	29.7	1.0	26.8	1689	430	✅
33	▲3	Back to Work? Early Termination of Pandemic Unemployment Benefits and Medicaid Home Care Provider Supply APE working paper #448 (v2)	30.0	1.1	26.8	1698	110
34	▲3	The Option Value of Municipal Liquidity AEJ: Policy	29.8	1.0	26.7	1690	391	✅
35	▲9	Tax and Occupancy of Business Properties: Evidence from UK Business Rate Reliefs AEJ: Policy	29.4	0.9	26.6	1677	447	✅
36	▲9	Female Leaders and Intrahousehold Dynamics: Evidence from State Elections in India AEJ: Policy	29.4	1.0	26.6	1678	441	✅
37	▲4	Perils of the Paperwork: The Impact of Information and Application Assistance on Social Benefit Take-Up in India AEJ: Policy	29.1	0.9	26.3	1664	488	✅
38	▲4	The Hidden Pre-Trend: How a Third Census Decade Exposes Identification Failure in WWII Service-Return Estimates APE working paper #586 (v1)	29.7	1.1	25.9	1689	68
39	▲36	Work From Home and the Office Real Estate Apocalypse AER	28.1	0.9	25.4	1625	477	✅
40	▼9	The Conviction Lottery: Legal Indeterminacy and Judicial Discretion in Brazil's Drug Courts APE working paper #1177 (v2)	34.5	3.0	25.3	1880	14
41	▲11	Polling Place Location and the Costs of Voting AEJ: Policy	28.0	0.9	25.3	1619	531	✅
42	▲7	The Sharecropping Escape: Flood-Induced Displacement and Black Occupational Upgrading in the Great Migration Era APE working paper #1287 (v1)	33.8	3.0	25.0	1854	16
43	▼27	Too Small by Design: How Threshold-Based Climate Policy Shrank the Panels It Subsidized APE working paper #727 (v4)	37.7	4.2	25.0	2008	10
44	▲6	The Democratic Cost of Consolidation: Municipal Mergers and Referendum Participation in Switzerland APE working paper #501 (v1)	28.4	1.2	24.9	1634	80
45	▲19	Hurricanes, Climate Change Policies and Electoral Accountability AEJ: Policy	27.5	0.9	24.8	1600	526	✅
46	▲10	Legislating Peace? Anti-Open Grazing Laws and Farmer-Herder Violence in Nigeria APE working paper #500 (v2)	28.6	1.4	24.6	1646	68
47	▼27	The Examination Lottery: Measuring Regulatory Inconsistency with Patent Continuation Twin Studies APE working paper #1116 (v1)	37.0	4.2	24.5	1980	12
48	▼15	The Inspection Lottery: How Regulatory Stringency Crowds Out Nursing Home Staffing APE working paper #1176 (v1)	34.2	3.3	24.4	1869	12
49	▲4	Black Lives: The High Cost of Segregation AEJ: Policy	27.0	0.9	24.3	1581	483	✅
50	▲4	Who Keeps House? The 1924 Immigration Act and the Domestic Servant Channel in Women's Labor Supply APE working paper #708 (v1)	28.6	1.5	24.1	1645	58
51	▼22	The Composition Illusion: Relative Pollution Differentials Without Medium-Specific Effects APE working paper #642 (v3)	36.6	4.2	24.1	1963	12
52	▲6	Closing the Golden Door: Individual Occupational Mobility After the 1924 Immigration Quota Act APE working paper #626 (v1)	28.3	1.4	24.1	1630	54
53	▲6	Regulatory Teeth and Housing Prices: A Multi-Cutoff RDD at France's Energy Label Boundaries APE working paper #503 (v1)	27.8	1.2	24.1	1612	78
54	▲1	Can't Ask, Won't Tell: Salary History Bans and the Gender Earnings Gap at Hire APE working paper #533 (v1)	27.5	1.2	24.0	1602	74
55	▼8	The Slow Dividend: Dam Removal and the Delayed Recovery of River Water Quality APE working paper #1072 (v1)	33.7	3.4	23.6	1848	10
56	▲12	Where Cultural Borders Cross: Gender Equality at the Intersection of Language and Religion in Swiss Direct Democracy APE working paper #439 (v3)	26.9	1.1	23.6	1575	104
57	▲15	China's Nationwide CO2 Emissions Trading System: A General Equilibrium Assessment AEJ: Policy	26.2	0.9	23.6	1547	560	✅
58	▲25	Polluted IPOs AEJ: Policy	26.0	0.9	23.4	1541	540	✅
59	▼8	The Stigma of Priority: School Catchment Boundaries and the Housing Price Penalty of Education Zones in France APE working paper #746 (v1)	28.7	1.6	23.4	1646	60
60	▼12	The Fifty-Bed Cliff: How Medicare Payment Rules Shrink Rural Hospitals APE working paper #1148 (v1)	33.0	3.2	23.4	1820	10

1–60 of 1025

Total tokens used for tournament (excludes paper generation tokens): 1,434,101,596