You could visualize contamination examples after generating the reports via:
python visualize/visualize_search.py
This will highlight the matched part of benchmark samples.
MMLU - An example of contamination visualization
Question: The economy is in a deep recession. Given this economic situation which of the following statements about monetary policy is accurate?
Matches:
| Page Name | Overlapping | Match Ratio | URL |
|---|---|---|---|
| AP Macroeconomics Question 445: Answer and Explanation - CrackAP.com | The economy is in a deep recession. Given this economic situation, which of the following statements about monetary policy is accurate? A. Expansionary policy would only worsen the recession. B. Expansionary policy greatly increases aggregate demand if investment is sensitive to changes in the interest rate. | 0.901 | Link |
| AP Macroeconomics Practice Test 21 - CrackAP.com | Given this economic situation, which of the following statements about monetary policy is accurate? A. Expansionary policy would only worsen the recession. B. Expansionary policy greatly increases aggregate demand if investment is sensitive to changes in the interest rate. | 0.615 | Link |
It will hightlight the overlapping part of the benchmark and internet pages.
Check more contamination examples: MMLU at here, and C-Eval at here
You could generate the figure to visualize the perplexity results via:
python visualize/visualize_perplexity.py
Check perplexity analysis example of QA (BoolQ, SQuAD, QuAD) benchmarks here