To preserve or improve chain-of-thought (CoT) monitorability, we have to be able to measure it.
I'm excited to announce our new research on this at OpenAI
Modern reasoning models think in plain English.
Monitoring their thoughts could be a powerful, yet fragile, tool for overseeing future AI systems.
I and researchers across many organizations think we should work to evaluate, preserve, and even improve CoT monitorability.
Detecting misbehavior in frontier reasoning models
Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving
We need new technical breakthroughs to steer and control AI systems much smarter than us.
Our new Superalignment team aims to solve this problem within 4 years, and we’re dedicating 20% of the compute we've secured to date towards this problem.
Join us! openai.com/blog/introduci…
CoT monitoring is already useful!
In a recent OpenAI blog and paper, we showed that we could catch reward hacks in code via CoT monitoring.
Others have found they can catch early signals of misalignment, prompt injections, and evaluation awareness.
For hard enough tasks, models may have to reason out loud and be monitorable.
If actions that cause severe harm require complex reasoning, then this gives hope we could catch and stop them with CoT monitoring.
Like any safety layer, CoT monitoring has limitations.
- If CoT reasoning isn’t required, relying on a model’s propensity to reason out loud may not be enough for a strong safety case.
- More advanced, situationally aware models may be able to deliberately obfuscate their CoT
I am grateful to have worked closely with @tomekkorbak, @balesni, @rohinmshah and Vlad Mikulik on this paper, and I am very excited that researchers across many prominent AI institutions collaborated with us and came to consensus around this important direction.
One direction I'm excited to see more work on in the future is CoT monitoring as a potential scalable oversight method. In our work, we found that we could monitor a strong reasoning model (same class as o1 or o3-mini) with a weaker model (gpt-4o).
Furthermore, the existing CoT monitorability may be extremely fragile.
Higher-compute RL, alternative model architectures, certain forms of process supervision, etc. may all lead to models that obfuscate their thinking.
We argue that researchers should study
- How to evaluate monitorability
- When CoT monitoring can be relied on as a load-bearing safety measure
- How different pieces of the training stack effect monitorability
- How to construct better monitors
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted.
We were surprised, so we dug deeper 🔎🧵(1/)
x.com/OpenAI/status/…
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.