Problem
The per-server centroid in EmbeddingAnomalyGuard is updated with every output flagged as clean. A patient attacker can gradually shift the centroid toward malicious content by sending subtly crafted outputs that stay just below the anomaly threshold, eventually making injections appear normal.
Root cause
EmbeddingAnomalyGuard::update_centroid() accepts all non-anomalous outputs as clean training data. There is no verification that the content is actually clean — only that it was not flagged anomalous by the current (potentially already drifted) threshold.
Proposed fix
- Cap centroid update rate: apply Bayesian weighting so recent samples have diminishing influence once centroid is stable
- Periodic centroid re-anchoring from a trusted baseline (e.g., system prompt embeddings)
- Optionally: route centroid updates through the response verifier before accepting as clean
Priority
P2 — affects the long-term reliability of the embedding anomaly guard but requires sustained attacker access.
Related: PR #2310
Problem
The per-server centroid in
EmbeddingAnomalyGuardis updated with every output flagged as clean. A patient attacker can gradually shift the centroid toward malicious content by sending subtly crafted outputs that stay just below the anomaly threshold, eventually making injections appear normal.Root cause
EmbeddingAnomalyGuard::update_centroid()accepts all non-anomalous outputs as clean training data. There is no verification that the content is actually clean — only that it was not flagged anomalous by the current (potentially already drifted) threshold.Proposed fix
Priority
P2 — affects the long-term reliability of the embedding anomaly guard but requires sustained attacker access.
Related: PR #2310