Pinned
Ever wondered how likely your AI model is to misbehave?
We developed the *propensity lower bound* (PRBO), a variational lower bound on the probability of a model exhibiting a target (misaligned) behavior.
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸
We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎










