Skip to content

feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates #6642

@xinbenlv

Description

@xinbenlv

Original Request

hermes有telemetry和analytics吗,特别是 time/delay cost, token cost, money cost and failure/complete rate?
English Translation: Does Hermes have telemetry and analytics, especially for time/delay cost, token cost, money cost, and failure/complete rate?

Agent's Two Cents (could be wrong)

Everything below is the AI agent's best guess based on the current codebase.
Take with a grain of salt — the original request above is the only thing that came from a human.

Problem / Motivation

Hermes already tracks some usage/accounting data, but observability is fragmented. Today there are pieces of token accounting, estimated cost, session duration, and /insights, but there is no unified telemetry layer that answers the obvious operator questions: how long requests actually take, where latency comes from, how much they cost, and what percentage of runs end successfully vs timeout/abort/reset.

This matters because performance, reliability, and spend are now first-order product concerns. Without a canonical telemetry model, regressions become anecdotal and downstream users end up reverse-engineering behavior from logs and SQLite rows.

What We Checked

  • run_agent.py already accumulates session_prompt_tokens, session_completion_tokens, session_total_tokens, session_api_calls, and session_estimated_cost_usd.
  • run_agent.py logs per-API-call latency via logger.info(... latency=%.1fs ...), but this is log-level observability, not queryable analytics.
  • gateway/run.py already exposes /usage and /insights.
  • agent/insights.py already reports sessions/messages/tool calls, token totals, estimated cost, active time, average session duration, model/platform/tool breakdowns, and activity patterns.
  • hermes_state.py already stores session end_reason, token counts, cost fields, and timestamps.
  • What still seems missing: a canonical outcome taxonomy plus analytics for success/failure/completion rate; end-to-end latency breakdowns (TTFT, tool latency, queue/wait time, per-turn wall time); and one consistent telemetry surface across CLI, gateway, cron, delegated runs, and API server.

Proposed Solution

Introduce a first-class Hermes telemetry/analytics subsystem with a canonical event schema and rollups for:

  • Latency: per-turn wall time, per-API-call latency, TTFT, tool-call latency, idle/wait time, optional queueing delay
  • Usage: input/output/cache/reasoning tokens, tool-call counts, message counts
  • Cost: estimated and actual cost when available, with clear status/source
  • Outcome: completed / timed_out / interrupted / reset / compressed / errored / unknown
  • Breakdowns: by source platform, model, provider, toolset, command path, and session/job type

The initial UX can be modest: improve /insights, add a machine-readable export/CLI/API surface, and store normalized telemetry rows in the session DB. Fancy dashboards can come later.

Dependencies & Potential Blockers

  • Cross-cutting change touching the agent loop, session persistence, gateway, cron, delegated runs, and possibly API/plugin hooks.
  • Outcome semantics need a stable definition first; otherwise the numbers will be garbage with a nicer font.
  • Backward compatibility matters because sessions already stores partial accounting fields.
  • No major external infrastructure blocker is required for a first local/SQLite-backed version.

How to Validate

  • Run comparable conversations from CLI, Slack/Telegram gateway, cron, and delegated-task paths; confirm telemetry is recorded consistently for all of them.
  • Verify that a normal successful run increments the "completed" bucket.
  • Verify that inactivity timeout, manual interruption, session reset, and compression continuation produce distinct outcome classifications.
  • Confirm /insights (or a new export command) can report:
    • average and p95 latency
    • token totals
    • estimated cost totals
    • completion/failure rates by source and model
  • Confirm existing /usage and /insights behavior does not regress for old sessions with sparse data.

Best Validation Path

Best default path: add a deterministic integration test matrix over SessionDB + InsightsEngine + a small set of synthetic session transcripts/end reasons, then run one real smoke test per runtime path (CLI, gateway, cron, delegate) and assert that each produces the same normalized telemetry fields.

Best Human Demo

A before-vs-after terminal demo is the cleanest proof:

  1. run 3-4 scripted sessions that intentionally end in different ways (complete, timeout, interrupt, compression continuation)
  2. run /insights 7 or a new hermes insights --json
  3. show one screen with latency, cost, and outcome breakdowns that were previously impossible to answer

Scope Estimate

large

Key Files/Modules Likely Involved

  • run_agent.py
  • hermes_state.py
  • agent/insights.py
  • gateway/run.py
  • gateway/session.py

Architecture Diagram

+-------------------+        +-------------------+
| CLI / Gateway /   |        | Cron / API server |
| Delegate / MCP    |        | / batch runners   |
+---------+---------+        +---------+---------+
          |                            |
          +------------+  +------------+
                       v  v
                +-------------+
                | AIAgent loop |
                | run_agent.py |
                +------+------+ 
                       |
          +------------+-------------+
          |                          |
          v                          v
+--------------------+     +----------------------+
| tool / model events|     | outcome transitions  |
| latency, tokens,   |     | completed / timeout /|
| cost, model, etc.  |     | interrupt / reset... |
+----------+---------+     +----------+-----------+
           \                         /
            \                       /
             v                     v
             +---------------------+
             | normalized telemetry|
             | event + session     |
             | schema              |
             +----------+----------+
                        |
                        v
             +----------------------+
             | SessionDB / rollups  |
             | hermes_state.py      |
             +----------+-----------+
                        |
          +-------------+----------------+
          |                              |
          v                              v
+--------------------+       +-----------------------+
| InsightsEngine     |       | JSON/API export /     |
| human summaries    |       | downstream analysis   |
+--------------------+       +-----------------------+

Rough Implementation Sketch

  • Define a canonical telemetry schema for event-level and session-level rollups.
  • Define a canonical outcome taxonomy and map existing end_reason values into it.
  • Centralize telemetry emission in the agent/runtime paths instead of scattering ad hoc counters.
  • Extend SessionDB schema and migration logic for normalized fields and/or event tables.
  • Teach InsightsEngine to compute outcome rates, latency percentiles, and cost/usage breakdowns.
  • Expose the results in /insights, CLI output, and ideally a machine-readable export.
  • Add integration tests that intentionally exercise different termination paths.

Open Questions

  • Should this be session-rollup only first, or should Hermes store event-level telemetry from day one?
  • How should "complete" be defined for interactive chats where the user simply stops talking?
  • Should compression, session_reset, and session_switch count as neutral transitions or failed/completed outcomes?
  • Is there appetite for OpenTelemetry/Langfuse-style sinks later, or should v1 stay local-only?

Potential Risks or Gotchas

  • This is broad enough that piecemeal contributor patches are likely to fight each other or calcify the wrong schema.
  • Downstream/local carries are especially risky here because every runtime path must agree on semantics, and drift will make analytics actively misleading.
  • Because of the cross-cutting blast radius, this feels better suited to an author/core-maintainer-led design pass than a casual contributor feature branch.
  • If the project ships metrics before defining outcome semantics, people will trust numbers that do not mean what they think they mean.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildertype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions