fix(mcp): embedding guard centroid drifts toward attacker content over time (boiling frog)

## Problem

The per-server centroid in `EmbeddingAnomalyGuard` is updated with every output flagged as clean. A patient attacker can gradually shift the centroid toward malicious content by sending subtly crafted outputs that stay just below the anomaly threshold, eventually making injections appear normal.

## Root cause

`EmbeddingAnomalyGuard::update_centroid()` accepts all non-anomalous outputs as clean training data. There is no verification that the content is actually clean — only that it was not flagged anomalous by the current (potentially already drifted) threshold.

## Proposed fix

- Cap centroid update rate: apply Bayesian weighting so recent samples have diminishing influence once centroid is stable
- Periodic centroid re-anchoring from a trusted baseline (e.g., system prompt embeddings)
- Optionally: route centroid updates through the response verifier before accepting as clean

## Priority
P2 — affects the long-term reliability of the embedding anomaly guard but requires sustained attacker access.

Related: PR #2310

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): embedding guard centroid drifts toward attacker content over time (boiling frog) #2311

Problem

Root cause

Proposed fix

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix(mcp): embedding guard centroid drifts toward attacker content over time (boiling frog) #2311

Description

Problem

Root cause

Proposed fix

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions