feat(hermes): add surface attribution evaluation suite#2692
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
Greptile SummaryThis PR completes the Hermes RCA synthetic suite by adding a surface attribution evaluation track — scenario
Confidence Score: 5/5Safe to merge; changes are additive test infrastructure with no modifications to production code paths. All production-facing changes follow established backend/tool patterns and are exercised by the new tests. The only gap is in tests/synthetic/hermes_rca/refresh_adapter_tuples.py — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Scenario 050 fixture loaded] --> B[run_suite.py: score_result]
B --> C{required_keywords / root_cause_category}
C -->|pass/fail| D[ScenarioScore]
A --> E[surface_scoring.py: score_surface_response]
E --> F[score_adapter_family]
E --> G[score_analog_identification]
E --> H[score_diagnostic_question]
F & G & H --> I[SurfaceScore: passed_dimensions >= 2]
I -. not wired into run_suite .-> D
J[make refresh-hermes-tuples] --> K{validate_tuples: field presence only}
K -->|missing semantic checks| L[false validated message]
N[pytest test_surface_adapter_tuples] --> O{family + analog_id validated}
O --> P[full semantic validation]
Reviews (6): Last reviewed commit: "feat(hermes): add surface attribution ev..." | Re-trigger Greptile |
f6cd1c2 to
cdb4b61
Compare
0c1effc to
2c40550
Compare
2c40550 to
af51dd0
Compare
|
@greptile review |
|
Hi Anwesh, I only updated the README files for Part 5 in this pr. I’ll open separate prs for the other parts. |
|
nice work @cerencamkiran |
|
🐸 Rebase? Handled. Conflicts? Squashed. CI? Vibing. @cerencamkiran touched the untouchable and lived. 🫡 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #1513
Summary
This PR completes the Hermes RCA synthetic suite by adding the final evaluation track: surface attribution.
Previous Hermes RCA scenarios focused on identifying failures within specific domains (providers, orchestration, memory, controls, runtime reliability). However, Hermes deployments increasingly combine multiple messaging adapters, LLM providers, execution backends, memory systems, and control layers. In those environments, investigations must first determine which subsystem family owns the failure before deeper RCA can begin.
This PR introduces a deterministic evaluation framework for that attribution step.
What was added
New synthetic scenario
Added:
050-surface-sprawl-unknown-adapterThe scenario evaluates whether an investigation can:
Adapter catalog evidence
Added a new Hermes evidence source:
hermes_adapter_catalogand corresponding:
This allows attribution decisions to be grounded in an explicit catalog of registered Hermes surfaces rather than hard-coded assumptions.
Analog registry
Added:
analog_registry.pyThe registry provides curated mappings across Parts 1–4 of the Hermes RCA suite.
Rather than treating surface attribution as an isolated task, evaluations can now compare failures against previously validated scenarios and verify attribution consistency over time.
Current coverage includes provider, runtime, orchestration, memory, and control-related failures.
Surface attribution scoring
Added:
surface_scoring.pyThe scorer evaluates three independent dimensions:
A response must satisfy multiple dimensions to pass, reducing false-positive success cases.
Adapter tuple corpus
Added a deterministic attribution corpus containing 23 adapter combinations spanning:
Each tuple maps to an expected family and analog scenario.
This provides repeatable attribution coverage without requiring external services.
Benchmark tooling
Added:
New commands:
Benchmark snapshots can also be generated through:
Validation and coverage
Added:
The suite now verifies that every registered Hermes failure mode has at least one synthetic scenario.
Design goals
This implementation intentionally keeps attribution evaluation:
while still exercising reasoning patterns that are required in real multi-surface Hermes deployments.
The analog registry and tuple corpus are designed to be extensible as additional Hermes adapters and execution surfaces are added in future evaluation tracks.
Tests
Current result: