How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee¹, Thomas Zeng², Jongwon Jeong², Jy‑yong Sohn¹, Kangwook Lee^2,3

¹Yonsei University · ²University of Wisconsin–Madison · ³KRAFTON

Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.

Overview

Bias‑adjusted point estimate: theta = (p + q0 - 1) / (q0 + q1 - 1)
- Function: point_estimator(p, q0, q1) in llm_judge_reporting/calibration.py
Confidence interval: reflects uncertainty from both test (p) and calibration (q0, q1)
- Function: confidence_interval(p, q0, q1, n, m0, m1, alpha) in llm_judge_reporting/calibration.py
Calibration set allocation: distribute total budget m across m0 (specificity) and m1 (sensitivity)
- Function: allocate_calibration_sample(m, p, q0_pilot, q1_pilot, m_pilot, eps=1e-6) in llm_judge_reporting/allocation.py
Key Inputs:
- p: proportion judged “correct” on the test set, Pr(Predict = correct)
- q0: specificity, Pr(Predict = incorrect | True = incorrect)
- q1: sensitivity, Pr(Predict = correct | True = correct)
- n: test set size; m0, m1: calibration subset sizes for false/true items
- The judge is better than random: q0 + q1 > 1 (otherwise the denominator vanishes).
- Inputs are proportions in [0, 1] and counts are positive integers.

Install

From GitHub:

pip install "git+https://github.com/UW-Madison-Lee-Lab/LLM-judge-reporting.git"

From source (editable, for development):

git clone https://github.com/UW-Madison-Lee-Lab/LLM-judge-reporting.git
cd LLM-judge-reporting
pip install -e .

Usage

Point estimate and confidence interval:

from llm_judge_reporting import point_estimator, confidence_interval

p = 0.4; n = 1000
q0 = 0.7; q1 = 0.9; m0 = 200; m1 = 200

th_hat = point_estimator(p, q0, q1)
ci = confidence_interval(p, q0, q1, n, m0, m1, alpha=0.05)
print(f"theta_hat = {th_hat:.4f}")
print(f"95% CI = ({ci[0]:.4f}, {ci[1]:.4f})")

Allocate calibration samples:

from llm_judge_reporting import allocate_calibration_sample

p = 0.4
m = 200

m0, m1 = allocate_calibration_sample(m, p, q0_pilot=0.7, q1_pilot=0.9, m_pilot=10)
print("allocate m0,m1:", m0, m1)

Figures

Reproduce figures or further experiments:
- Figure 2 (bias and adjustment) can be regenerated without the notebook:
```
python -m run.figure2_bias_adjustment --output figures/figure2_bias_adjustment.png
```
- Figure 3 (CI length vs. calibration size) without the notebook:
```
python -m run.figure3_ci_length --output figures/figure3_ci_length.png
```
- Figure 4 (Monte Carlo simulation) without the notebook:
```
python -m run.figure4_monte_carlo --output figures/figure4_monte_carlo.png
```
Notebooks remain available for exploratory runs: figure2_bias and its adjustment.ipynb, figure3_confidence–interval length across calibration size.ipynb, figure4_Monte Carlo simulation.ipynb

Citation

@article{lee2025correctly,
  title         = {How to Correctly Report LLM-as-a-Judge Evaluations},
  author        = {Lee, Chungpa and Zeng, Thomas and Jeong, Jongwon and Sohn, Jy-yong and Lee, Kangwook},
  year          = {2025},
  eprint        = {2511.21140},
  archivePrefix = {arXiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
llm_judge_reporting		llm_judge_reporting
run		run
.gitignore		.gitignore
README.md		README.md
figure.png		figure.png
figure2_bias and its adjustment.ipynb		figure2_bias and its adjustment.ipynb
figure3_confidence–interval length across calibration size.ipynb		figure3_confidence–interval length across calibration size.ipynb
figure4_Monte Carlo simulation.ipynb		figure4_Monte Carlo simulation.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee¹, Thomas Zeng², Jongwon Jeong², Jy‑yong Sohn¹, Kangwook Lee^2,3

¹Yonsei University · ²University of Wisconsin–Madison · ³KRAFTON

Overview

Install

Usage

Figures

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee1, Thomas Zeng2, Jongwon Jeong2, Jy‑yong Sohn1, Kangwook Lee2,3 1Yonsei University · 2University of Wisconsin–Madison · 3KRAFTON

Overview

Install

Usage

Figures

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Chungpa Lee¹, Thomas Zeng², Jongwon Jeong², Jy‑yong Sohn¹, Kangwook Lee^2,3

¹Yonsei University · ²University of Wisconsin–Madison · ³KRAFTON

Packages