Inspiration
The idea came from a simple question: if you show Claude a fake academic paper asserting a wrong answer, will it cave? And more interestingly — can you see the moment it decides to cave, in its own reasoning trace, before it even writes the final answer? We wanted to study sycophancy not as a binary outcome but as a process visible inside the model's thinking. Extended thinking mode gave us a rare window into that process, and we built the entire experiment around exploiting it.
What it does
We test whether Claude will accept a wrong answer when fake academic papers assert it as fact. For each of 1,756 trivia questions, we generate up to three domain-matched fake papers — all planting the same wrong answer — and run Claude through three conditions (1, 2, or 3 papers) with extended thinking enabled. We capture the full reasoning trace each time, label each response as resisted, capitulated, or hedged using a judge model, and train a classifier to predict the outcome from the thinking trace alone — before reading the final answer.
How we built it
We built a full data pipeline: wrong answer generation, domain classification, fake academic document generation (domain-matched journals, realistic citations, varied authors), a concurrent 20-worker API pipeline with crash recovery, a judge model (temp=0, no thinking) that labels responses without seeing the reasoning trace, and a feature extractor that pulls interpretable signals from thinking traces — doubt language, self-correction markers, thinking length, whether the model recalled the correct answer mid-reasoning. A UMAP visualization lets us explore the feature space interactively.
Challenges we ran into
The biggest: our initial capitulation metric fired 100% of the time because the model mentions the wrong answer even while resisting it ("the document says Leeds, but the answer is York"). That forced us to build the judge. We also hit Redis auth quirks on Windows requiring a custom raw-socket client, content moderation refusing to generate fake citations for certain science topics, and malformed wrong answers leaking prompt labels into the output — each requiring targeted fixes. Timing was also a major issue, if we created our datasets sequentially (as had been the initial case), finishing would've taken around 100 hours, which prompted us to look for new avenues (workers) to speed up the process by 15-30x.
Accomplishments that we're proud of
An 84% accurate classifier that predicts capitulation from the reasoning trace alone, before reading the final answer. The judge-classifier separation — the judge never sees the thinking trace, so the classifier is learning an independent signal. And the finding that thinking length and doubt language are stronger resistance predictors than document count.
What we learned
Capitulation is visible in the reasoning process, not just the conclusion. When the model thinks longer and expresses doubt about the documents, it almost always resists. When it thinks briefly and defers, it almost always capitulates. More documents didn't reliably suppress deliberation — but when deliberation was already thin, additional documents pushed capitulation higher. We also learned that prompt framing matters enormously: explicitly warning the model that documents may be inaccurate suppresses capitulation dramatically, which means naturalistic RAG settings are far more vulnerable than controlled evaluations suggest.
What's next for Reasoning Under Pressure
Scaling to more models (GPT-4o, Gemini) to see if the thinking-length/resistance correlation holds across architectures. Testing intervention prompts — does telling the model to "think carefully before trusting the documents" shift the thin/thick ratio? And using the classifier as a real-time monitor: flagging RAG responses where the thinking trace pattern looks like capitulation before the answer is served to the user.
Built With
- anthropic-api
- arize-phoenix
- opentelemetry
- plotly
- python
- redis
- scikit-learn
- sentence-transformers
- umap
Log in or sign up for Devpost to join the conversation.