Methods
Obfuscation example. A problem before (left) and after (right) obfuscation with the simplified inductive reasoning steps needed for answering. For each obfuscation, a character mapping is sampled based on the ruleset, then applied to obfuscate the problem and the answer. Only the original problem can be solved with the aid of models' internalised knowledge.
LingOly-TOO extends the LingOly benchmark by applying reasoning-equivariant permutations to 82 Linguistics Olympiad problems from the UK Linguistics Olympiad (UKLO). These problems are self-contained, require no prior linguistic knowledge, and can be solved from context using general reasoning and pattern matching.
We designed expert-crafted obfuscation rulesets for each problem that permute the orthography of the target language (Problemese) at the grapheme level. These permutations preserve the underlying linguistic mechanisms and solution logic—treating graphemes, morphemes, and phonological relationships as atomic units—while transforming the text into forms that cannot appear in any training corpus. This ensures models cannot rely on memorised knowledge or language familiarity, and must instead reason inductively from context.
We manually annotated 1,005 sub-question/answer pairs, removed metadata that could serve as shortcuts (language names, families, geographic clues), and generated up to 6 valid permutations per problem, yielding 6,995 sub-question/answer pairs in total. All annotations were validated by team members with Linguistics Olympiad expertise, and a sample was independently audited by two International Linguistics Olympiad medallists. A randomised controlled trial with 172 human participants confirmed only a modest 5.7% performance decrease from obfuscation.
Results
We evaluated fifteen reasoning and general-purpose models (both open-source and closed-source) on LingOly-TOO. We report two scores per model: Mog, based on original (unobfuscated) problems, and Mobf, based on obfuscated problems. The gap between these scores quantifies how much models rely on knowledge shortcuts rather than genuine reasoning.
Reasoning performance on LingOly-TOO. Results without controlling for knowledge and memorisation overestimate reasoning abilities (light blue). Obfuscation mitigates this effect and offers improved estimates (dark blue).
Frontier models achieve around 0.60 on Mog, but this drops to a maximum of 0.48 on Mobf. Reasoning models consistently outperform general-purpose models: GPT-5 outscores GPT 4.5, Claude 3.7 (thinking) outperforms Claude 3.7 (no thinking) at 0.43 vs. 0.30, and increasing inference-time budget helps (o3-mini high: 0.31 vs. o3-mini low: 0.12). At the hardest difficulty level (Round 2), even the best model scores only 0.31, demonstrating the benchmark is far from saturated.
Key Contributions
- An unsaturated benchmark for frontier reasoning models. The top model, GPT-5, scores 48% overall and only 31% on the highest difficulty problems. No model exceeds 50%, confirming that multi-hop inductive reasoning remains an open challenge.
- A method to quantify knowledge effects. The gap between Mog and Mobf reveals reasoning shortcuts. Score inflation from knowledge correlates with language resourcedness—models perform disproportionately better on high-resource languages in the unobfuscated setting. Providing expert reasoning guidance bridges this gap, increasing mean scores from 0.66 to 0.76.
- A method for generating uncontaminated reasoning problems. Experiments with then-unpublished UKLO 2025 problems show that the performance drop from obfuscation persists, confirming effects are not solely due to training-set memorisation but reflect genuine reliance on language knowledge.
- Evidence for brittle reasoning in frontier reasoning models. Under a robust metric that takes the minimum score across all permutations of a problem, the best model (GPT-5) drops from 0.48 to 0.28, revealing high variance and inconsistency across permutations even in frontier reasoning models.