Skip to content

feat(hermes): add surface attribution evaluation suite#2692

Merged
muddlebee merged 2 commits into
Tracer-Cloud:mainfrom
cerencamkiran:feat/hermes-surface-attribution-eval
Jun 5, 2026
Merged

feat(hermes): add surface attribution evaluation suite#2692
muddlebee merged 2 commits into
Tracer-Cloud:mainfrom
cerencamkiran:feat/hermes-surface-attribution-eval

Conversation

@cerencamkiran

@cerencamkiran cerencamkiran commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Fixes #1513

Summary

This PR completes the Hermes RCA synthetic suite by adding the final evaluation track: surface attribution.

Previous Hermes RCA scenarios focused on identifying failures within specific domains (providers, orchestration, memory, controls, runtime reliability). However, Hermes deployments increasingly combine multiple messaging adapters, LLM providers, execution backends, memory systems, and control layers. In those environments, investigations must first determine which subsystem family owns the failure before deeper RCA can begin.

This PR introduces a deterministic evaluation framework for that attribution step.

What was added

New synthetic scenario

Added:

  • 050-surface-sprawl-unknown-adapter

The scenario evaluates whether an investigation can:

  • identify the failing surface family
  • attribute an unknown adapter to the closest known subsystem
  • select the closest historical analog from prior Hermes RCA scenarios
  • generate a targeted diagnostic follow-up question

Adapter catalog evidence

Added a new Hermes evidence source:

  • hermes_adapter_catalog

and corresponding:

  • schema validation
  • scenario loading support
  • mock backend support
  • investigation tool wiring

This allows attribution decisions to be grounded in an explicit catalog of registered Hermes surfaces rather than hard-coded assumptions.

Analog registry

Added:

  • analog_registry.py

The registry provides curated mappings across Parts 1–4 of the Hermes RCA suite.

Rather than treating surface attribution as an isolated task, evaluations can now compare failures against previously validated scenarios and verify attribution consistency over time.

Current coverage includes provider, runtime, orchestration, memory, and control-related failures.

Surface attribution scoring

Added:

  • surface_scoring.py

The scorer evaluates three independent dimensions:

  1. Correct surface-family attribution
  2. Correct analog selection
  3. Quality of the diagnostic follow-up question

A response must satisfy multiple dimensions to pass, reducing false-positive success cases.

Adapter tuple corpus

Added a deterministic attribution corpus containing 23 adapter combinations spanning:

  • messaging adapters
  • LLM providers
  • execution backends
  • orchestration systems
  • memory systems
  • control layers

Each tuple maps to an expected family and analog scenario.

This provides repeatable attribution coverage without requiring external services.

Benchmark tooling

Added:

  • benchmark history snapshots
  • benchmark report generation
  • tuple refresh utility
  • Makefile integration

New commands:

make test-hermes-synthetic-only
make refresh-hermes-tuples

Benchmark snapshots can also be generated through:

python -m tests.synthetic.hermes_rca.run_suite --offline-only --write-history

Validation and coverage

Added:

  • coverage-health validation
  • tuple corpus validation
  • benchmark-history validation
  • surface-scoring validation

The suite now verifies that every registered Hermes failure mode has at least one synthetic scenario.

Design goals

This implementation intentionally keeps attribution evaluation:

  • deterministic
  • offline-runnable
  • provider-independent
  • CI-friendly

while still exercising reasoning patterns that are required in real multi-surface Hermes deployments.

The analog registry and tuple corpus are designed to be extensible as additional Hermes adapters and execution surfaces are added in future evaluation tracks.

Tests

python -m ruff check tests/synthetic/hermes_rca app/tools/HermesSessionEvidenceTool tests/synthetic/mock_hermes_backend

python -m pytest tests/synthetic/hermes_rca -q

Current result:

25 passed
Ekran görüntüsü 2026-06-01 195711 Ekran görüntüsü 2026-06-01 180753 Ekran görüntüsü 2026-06-01 180810

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@greptile-apps

greptile-apps Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR completes the Hermes RCA synthetic suite by adding a surface attribution evaluation track — scenario 050-surface-sprawl-unknown-adapter — along with the supporting analog registry, three-dimensional scorer, adapter tuple corpus, benchmark history tooling, and a new hermes_adapter_catalog evidence source wired through the schema, loader, mock backend, and tool layer.

  • New evaluation framework: analog_registry.py maps 21 prior scenarios across five failure families; surface_scoring.py scores family attribution, analog identification, and diagnostic question quality independently; a tuple corpus of 23 adapter combinations provides deterministic, offline-runnable coverage.
  • Tooling additions: --write-history flag in run_suite.py with a correctly isolated test (uses tmp_path + monkeypatch), benchmark_report.py for reading the latest snapshot, and make refresh-hermes-tuples / make test-hermes-synthetic-only Makefile targets.
  • Schema / backend plumbing: hermes_adapter_catalog added as evidence source, trajectory action, TypedDict, validator, and mock backend method, following the existing pattern throughout.

Confidence Score: 5/5

Safe to merge; changes are additive test infrastructure with no modifications to production code paths.

All production-facing changes follow established backend/tool patterns and are exercised by the new tests. The only gap is in refresh_adapter_tuples.py, which prints validated without checking family names or analog IDs against their registries — affecting only the developer-facing Makefile utility, not CI correctness or any runtime path.

tests/synthetic/hermes_rca/refresh_adapter_tuples.py — the validate_tuples function should mirror the semantic checks in test_surface_adapter_tuples_reference_known_families_and_analogs.

Important Files Changed

Filename Overview
tests/synthetic/hermes_rca/refresh_adapter_tuples.py Validates required field presence and type but skips semantic checks (valid family names and known analog IDs) that the test suite enforces, creating a misleading "validated" signal from make refresh-hermes-tuples.
tests/synthetic/hermes_rca/surface_scoring.py New three-dimension scorer (family, analog, diagnostic question) with reasonable alias matching; score_diagnostic_question correctly handles multi-question outputs using any().
tests/synthetic/hermes_rca/run_suite.py Adds --write-history flag and _write_history_snapshot; test correctly patches HISTORY_DIR via monkeypatch to avoid leaking files.
tests/synthetic/hermes_rca/analog_registry.py Frozen dataclass registry covering Parts 1-4 scenario IDs with no duplicates; lookup helpers are straightforward and well-tested.
tests/synthetic/hermes_rca/hermes_schemas.py Correctly adds hermes_adapter_catalog evidence source, trajectory action, TypedDict, and validator following the existing pattern.
app/tools/HermesSessionEvidenceTool/init.py Adds get_hermes_adapter_catalog tool following existing _backend_or_error pattern; correctly listed in __all__ and _TOOLS_WITHOUT_DELIBERATE_CATCH.
tests/e2e/hermes/meta/test_surface_sprawl.py Parametrized e2e test scoring a deterministically constructed synthetic response against all 23 adapter tuples; marked @pytest.mark.e2e so it only runs on schedule/dispatch.
tests/synthetic/hermes_rca/test_benchmark_history.py Correctly patches HISTORY_DIR with tmp_path + monkeypatch, so no real benchmark files are written during CI or local pytest runs.
.github/workflows/hermes-tests.yml New workflow correctly gates e2e tests behind schedule/workflow_dispatch to avoid provider charges on every PR.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Scenario 050 fixture loaded] --> B[run_suite.py: score_result]
    B --> C{required_keywords / root_cause_category}
    C -->|pass/fail| D[ScenarioScore]
    A --> E[surface_scoring.py: score_surface_response]
    E --> F[score_adapter_family]
    E --> G[score_analog_identification]
    E --> H[score_diagnostic_question]
    F & G & H --> I[SurfaceScore: passed_dimensions >= 2]
    I -. not wired into run_suite .-> D
    J[make refresh-hermes-tuples] --> K{validate_tuples: field presence only}
    K -->|missing semantic checks| L[false validated message]
    N[pytest test_surface_adapter_tuples] --> O{family + analog_id validated}
    O --> P[full semantic validation]
Loading

Reviews (6): Last reviewed commit: "feat(hermes): add surface attribution ev..." | Re-trigger Greptile

Comment thread tests/synthetic/hermes_rca/test_benchmark_history.py Outdated
Comment thread tests/synthetic/hermes_rca/surface_scoring.py
@cerencamkiran cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch from f6cd1c2 to cdb4b61 Compare June 1, 2026 14:56
@cerencamkiran cerencamkiran marked this pull request as draft June 1, 2026 14:57
@cerencamkiran cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch 4 times, most recently from 0c1effc to 2c40550 Compare June 1, 2026 17:17
@cerencamkiran cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch from 2c40550 to af51dd0 Compare June 1, 2026 17:33
@cerencamkiran

Copy link
Copy Markdown
Collaborator Author

@greptile review

@cerencamkiran cerencamkiran marked this pull request as ready for review June 1, 2026 17:43
@cerencamkiran

Copy link
Copy Markdown
Collaborator Author

Hi Anwesh, I only updated the README files for Part 5 in this pr. I’ll open separate prs for the other parts.

@muddlebee muddlebee merged commit bafc44c into Tracer-Cloud:main Jun 5, 2026
16 checks passed
@muddlebee

Copy link
Copy Markdown
Collaborator

nice work @cerencamkiran

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🐸 Rebase? Handled. Conflicts? Squashed. CI? Vibing. @cerencamkiran touched the untouchable and lived. 🫡


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hermes incident-identification scenarios 5/5 — Surface-sprawl meta-test + harness wiring (Makefile, CI, runbook)

2 participants