feat(hermes): add surface attribution evaluation suite by cerencamkiran · Pull Request #2692 · Tracer-Cloud/opensre

cerencamkiran · 2026-06-01T14:50:14Z

Summary

This PR completes the Hermes RCA synthetic suite by adding the final evaluation track: surface attribution.

Previous Hermes RCA scenarios focused on identifying failures within specific domains (providers, orchestration, memory, controls, runtime reliability). However, Hermes deployments increasingly combine multiple messaging adapters, LLM providers, execution backends, memory systems, and control layers. In those environments, investigations must first determine which subsystem family owns the failure before deeper RCA can begin.

This PR introduces a deterministic evaluation framework for that attribution step.

What was added

New synthetic scenario

Added:

050-surface-sprawl-unknown-adapter

The scenario evaluates whether an investigation can:

identify the failing surface family
attribute an unknown adapter to the closest known subsystem
select the closest historical analog from prior Hermes RCA scenarios
generate a targeted diagnostic follow-up question

Adapter catalog evidence

Added a new Hermes evidence source:

hermes_adapter_catalog

and corresponding:

schema validation
scenario loading support
mock backend support
investigation tool wiring

This allows attribution decisions to be grounded in an explicit catalog of registered Hermes surfaces rather than hard-coded assumptions.

Analog registry

Added:

analog_registry.py

The registry provides curated mappings across Parts 1–4 of the Hermes RCA suite.

Rather than treating surface attribution as an isolated task, evaluations can now compare failures against previously validated scenarios and verify attribution consistency over time.

Current coverage includes provider, runtime, orchestration, memory, and control-related failures.

Surface attribution scoring

Added:

surface_scoring.py

The scorer evaluates three independent dimensions:

Correct surface-family attribution
Correct analog selection
Quality of the diagnostic follow-up question

A response must satisfy multiple dimensions to pass, reducing false-positive success cases.

Adapter tuple corpus

Added a deterministic attribution corpus containing 23 adapter combinations spanning:

messaging adapters
LLM providers
execution backends
orchestration systems
memory systems
control layers

Each tuple maps to an expected family and analog scenario.

This provides repeatable attribution coverage without requiring external services.

Benchmark tooling

Added:

benchmark history snapshots
benchmark report generation
tuple refresh utility
Makefile integration

New commands:

make test-hermes-synthetic-only
make refresh-hermes-tuples

Benchmark snapshots can also be generated through:

python -m tests.synthetic.hermes_rca.run_suite --offline-only --write-history

Validation and coverage

Added:

coverage-health validation
tuple corpus validation
benchmark-history validation
surface-scoring validation

The suite now verifies that every registered Hermes failure mode has at least one synthetic scenario.

Design goals

This implementation intentionally keeps attribution evaluation:

deterministic
offline-runnable
provider-independent
CI-friendly

while still exercising reasoning patterns that are required in real multi-surface Hermes deployments.

The analog registry and tuple corpus are designed to be extensible as additional Hermes adapters and execution surfaces are added in future evaluation tracks.

Tests

python -m ruff check tests/synthetic/hermes_rca app/tools/HermesSessionEvidenceTool tests/synthetic/mock_hermes_backend

python -m pytest tests/synthetic/hermes_rca -q

Current result:

25 passed

github-actions · 2026-06-01T14:50:28Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

greptile-apps · 2026-06-01T14:54:39Z

Greptile Summary

This PR completes the Hermes RCA synthetic suite by adding a surface attribution evaluation track — scenario 050-surface-sprawl-unknown-adapter — along with the supporting analog registry, three-dimensional scorer, adapter tuple corpus, benchmark history tooling, and a new hermes_adapter_catalog evidence source wired through the schema, loader, mock backend, and tool layer.

New evaluation framework: analog_registry.py maps 21 prior scenarios across five failure families; surface_scoring.py scores family attribution, analog identification, and diagnostic question quality independently; a tuple corpus of 23 adapter combinations provides deterministic, offline-runnable coverage.
Tooling additions: --write-history flag in run_suite.py with a correctly isolated test (uses tmp_path + monkeypatch), benchmark_report.py for reading the latest snapshot, and make refresh-hermes-tuples / make test-hermes-synthetic-only Makefile targets.
Schema / backend plumbing: hermes_adapter_catalog added as evidence source, trajectory action, TypedDict, validator, and mock backend method, following the existing pattern throughout.

Confidence Score: 5/5

Safe to merge; changes are additive test infrastructure with no modifications to production code paths.

All production-facing changes follow established backend/tool patterns and are exercised by the new tests. The only gap is in refresh_adapter_tuples.py, which prints validated without checking family names or analog IDs against their registries — affecting only the developer-facing Makefile utility, not CI correctness or any runtime path.

tests/synthetic/hermes_rca/refresh_adapter_tuples.py — the validate_tuples function should mirror the semantic checks in test_surface_adapter_tuples_reference_known_families_and_analogs.

Important Files Changed

Filename	Overview
tests/synthetic/hermes_rca/refresh_adapter_tuples.py	Validates required field presence and type but skips semantic checks (valid family names and known analog IDs) that the test suite enforces, creating a misleading "validated" signal from `make refresh-hermes-tuples`.
tests/synthetic/hermes_rca/surface_scoring.py	New three-dimension scorer (family, analog, diagnostic question) with reasonable alias matching; `score_diagnostic_question` correctly handles multi-question outputs using `any()`.
tests/synthetic/hermes_rca/run_suite.py	Adds `--write-history` flag and `_write_history_snapshot`; test correctly patches `HISTORY_DIR` via monkeypatch to avoid leaking files.
tests/synthetic/hermes_rca/analog_registry.py	Frozen dataclass registry covering Parts 1-4 scenario IDs with no duplicates; lookup helpers are straightforward and well-tested.
tests/synthetic/hermes_rca/hermes_schemas.py	Correctly adds `hermes_adapter_catalog` evidence source, trajectory action, TypedDict, and validator following the existing pattern.
app/tools/HermesSessionEvidenceTool/init.py	Adds `get_hermes_adapter_catalog` tool following existing `_backend_or_error` pattern; correctly listed in `__all__` and `_TOOLS_WITHOUT_DELIBERATE_CATCH`.
tests/e2e/hermes/meta/test_surface_sprawl.py	Parametrized e2e test scoring a deterministically constructed synthetic response against all 23 adapter tuples; marked `@pytest.mark.e2e` so it only runs on schedule/dispatch.
tests/synthetic/hermes_rca/test_benchmark_history.py	Correctly patches `HISTORY_DIR` with `tmp_path` + `monkeypatch`, so no real benchmark files are written during CI or local pytest runs.
.github/workflows/hermes-tests.yml	New workflow correctly gates e2e tests behind `schedule`/`workflow_dispatch` to avoid provider charges on every PR.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Scenario 050 fixture loaded] --> B[run_suite.py: score_result]
    B --> C{required_keywords / root_cause_category}
    C -->|pass/fail| D[ScenarioScore]
    A --> E[surface_scoring.py: score_surface_response]
    E --> F[score_adapter_family]
    E --> G[score_analog_identification]
    E --> H[score_diagnostic_question]
    F & G & H --> I[SurfaceScore: passed_dimensions >= 2]
    I -. not wired into run_suite .-> D
    J[make refresh-hermes-tuples] --> K{validate_tuples: field presence only}
    K -->|missing semantic checks| L[false validated message]
    N[pytest test_surface_adapter_tuples] --> O{family + analog_id validated}
    O --> P[full semantic validation]

_{Reviews (6): Last reviewed commit: "feat(hermes): add surface attribution ev..." | Re-trigger Greptile}

cerencamkiran · 2026-06-01T17:34:07Z

@greptile review

cerencamkiran · 2026-06-01T17:50:55Z

Hi Anwesh, I only updated the README files for Part 5 in this pr. I’ll open separate prs for the other parts.

muddlebee · 2026-06-05T18:40:48Z

nice work @cerencamkiran

github-actions · 2026-06-05T18:41:02Z

🐸 Rebase? Handled. Conflicts? Squashed. CI? Vibing. @cerencamkiran touched the untouchable and lived. 🫡

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/synthetic/hermes_rca/test_benchmark_history.py Outdated

Comment thread tests/synthetic/hermes_rca/surface_scoring.py

cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch from f6cd1c2 to cdb4b61 Compare June 1, 2026 14:56

cerencamkiran marked this pull request as draft June 1, 2026 14:57

cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch 4 times, most recently from 0c1effc to 2c40550 Compare June 1, 2026 17:17

feat(hermes): add surface attribution evaluation suite

af51dd0

cerencamkiran force-pushed the feat/hermes-surface-attribution-eval branch from 2c40550 to af51dd0 Compare June 1, 2026 17:33

cerencamkiran marked this pull request as ready for review June 1, 2026 17:43

fix: resolve Makefile merge conflict

7b21992

muddlebee merged commit bafc44c into Tracer-Cloud:main Jun 5, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hermes): add surface attribution evaluation suite#2692

feat(hermes): add surface attribution evaluation suite#2692
muddlebee merged 2 commits into
Tracer-Cloud:mainfrom
cerencamkiran:feat/hermes-surface-attribution-eval

cerencamkiran commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented Jun 1, 2026

Uh oh!

cerencamkiran commented Jun 1, 2026

Uh oh!

Uh oh!

muddlebee commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cerencamkiran commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was added

New synthetic scenario

Adapter catalog evidence

Analog registry

Surface attribution scoring

Adapter tuple corpus

Benchmark tooling

Validation and coverage

Design goals

Tests

Uh oh!

github-actions Bot commented Jun 1, 2026

Greptile code review

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

cerencamkiran commented Jun 1, 2026

Uh oh!

cerencamkiran commented Jun 1, 2026

Uh oh!

Uh oh!

muddlebee commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cerencamkiran commented Jun 1, 2026 •

edited

Loading

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading