Description
[llm.router] quality_gate = 0.75 applies globally to all LLM calls routed through the main provider, including graph entity extraction. The quality gate measures cosine similarity between the query embedding and the response embedding. For graph extraction tasks (JSON input → JSON output with entities/edges), the structural dissimilarity between the extraction prompt and the structured JSON response causes systematic false positives (score ~0.55–0.70), below the 0.75 threshold.
This causes all graph extraction LLM calls to fall back to the next provider on every turn, adding latency and unnecessary provider cycling even when the extraction result is correct.
Reproduction Steps
- Configure
[llm.router] quality_gate = 0.75 (default in testing.toml)
- Run a multi-turn session with graph memory enabled (
[memory.graph] enabled = true)
- Observe logs:
INFO memory.graph_extract: thompson_quality_fallback provider="openai" score=0.56 threshold=0.75
INFO memory.graph_extract: thompson_quality_fallback provider="openai" score=0.58 threshold=0.75
INFO memory.graph_extract: thompson_quality_fallback provider="openai" score=0.57 threshold=0.75
All extraction calls fail the gate regardless of provider. The pattern repeats across every turn.
Expected Behavior
Graph extraction calls should bypass the quality gate, or the gate should only apply to conversational LLM calls. The quality gate is designed for coherence between user queries and assistant responses — not for structured JSON extraction tasks.
Actual Behavior
Every graph extraction LLM call logs thompson_quality_fallback with score ~0.55–0.70, below the 0.75 threshold. Since all providers fail the gate, the router returns the best-seen response on exhaustion (M2 path), adding unnecessary latency (all provider calls are made before returning).
Root Cause
spawn_graph_extraction uses self.provider.clone() — the main SemanticMemory provider. apply_routing_signals() in src/bootstrap/provider.rs:184 applies quality_gate globally to this provider. GraphConfig does not expose a separate extract_provider: ProviderName field (unlike ReasoningConfig and CompressionConfig which do), so there is no way to configure a provider without the quality gate for graph extraction.
Suggested Fix
- Add
extract_provider: ProviderName to GraphConfig (matching the pattern already used by [memory.reasoning] and [memory.compression])
- Build this provider without
quality_gate for graph extraction calls (since JSON extraction coherence is not measurable by response/query embedding similarity)
- Alternatively, add a per-call context label so the quality gate can be skipped for task-specific (non-conversational) LLM calls
Environment
- Version: 0.20.1 (a030b2a)
- Config: .local/config/testing.toml
- Features: full
- Observed: CI-668
Logs / Evidence
All graph_extract calls during a 2-turn session:
score=0.5765, score=0.5877, score=0.5695, score=0.7035, score=0.6081, score=0.6620, score=0.5601, score=0.5653
All below threshold 0.75. No single call passes the gate.
Description
[llm.router] quality_gate = 0.75applies globally to all LLM calls routed through the main provider, including graph entity extraction. The quality gate measures cosine similarity between the query embedding and the response embedding. For graph extraction tasks (JSON input → JSON output with entities/edges), the structural dissimilarity between the extraction prompt and the structured JSON response causes systematic false positives (score ~0.55–0.70), below the 0.75 threshold.This causes all graph extraction LLM calls to fall back to the next provider on every turn, adding latency and unnecessary provider cycling even when the extraction result is correct.
Reproduction Steps
[llm.router] quality_gate = 0.75(default in testing.toml)[memory.graph] enabled = true)All extraction calls fail the gate regardless of provider. The pattern repeats across every turn.
Expected Behavior
Graph extraction calls should bypass the quality gate, or the gate should only apply to conversational LLM calls. The quality gate is designed for coherence between user queries and assistant responses — not for structured JSON extraction tasks.
Actual Behavior
Every graph extraction LLM call logs
thompson_quality_fallbackwith score ~0.55–0.70, below the 0.75 threshold. Since all providers fail the gate, the router returns the best-seen response on exhaustion (M2 path), adding unnecessary latency (all provider calls are made before returning).Root Cause
spawn_graph_extractionusesself.provider.clone()— the mainSemanticMemoryprovider.apply_routing_signals()insrc/bootstrap/provider.rs:184appliesquality_gateglobally to this provider.GraphConfigdoes not expose a separateextract_provider: ProviderNamefield (unlikeReasoningConfigandCompressionConfigwhich do), so there is no way to configure a provider without the quality gate for graph extraction.Suggested Fix
extract_provider: ProviderNametoGraphConfig(matching the pattern already used by[memory.reasoning]and[memory.compression])quality_gatefor graph extraction calls (since JSON extraction coherence is not measurable by response/query embedding similarity)Environment
Logs / Evidence
All graph_extract calls during a 2-turn session:
All below threshold 0.75. No single call passes the gate.