Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield¹, Philip Torr², Ioannis Patras¹, Adel Bibi² Fazl Barez^2,3,4

¹Queen Mary University of London, ²University of Oxford, ³WhiteBox, ⁴Martian

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts.

Overview

The codebase contains the following key files:

model.py contains the model definitions (for the TPC and baselines)
train.py contains the training scripts
test_poly_forward.py contains unit tests to ensure that the symmetric forward pass matches that when materializing full tensors
utils.py helper utils
extract/* contains files to save intermediate activations to disk
sweep_monitors.py is the main script to reproduce the results.
sweep.sh is the main example script to train all models and reproduce the results.

Citation

If you find our work useful, please consider citing our paper:

@misc{oldfield2025tpc,
    title={Beyond Linear Probes: Dynamic Safety Monitoring for Language Models},
    author={James Oldfield and Philip Torr and Ioannis Patras and Adel Bibi and Fazl Barez},
    year={2025},
    eprint={2509.26238},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Contact

Please feel free to get in touch at: jamesalexanderoldfield@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Overview

Citation

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
extract		extract
figures		figures
model.py		model.py
readme.md		readme.md
sweep.sh		sweep.sh
sweep_monitors.py		sweep_monitors.py
test_poly_forward.py		test_poly_forward.py
train.py		train.py
utils.py		utils.py

james-oldfield/tpc

Folders and files

Latest commit

History

Repository files navigation

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Overview

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages