feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates

## Original Request

> hermes有telemetry和analytics吗，特别是 time/delay cost, token cost, money cost and failure/complete rate?
> **English Translation:** Does Hermes have telemetry and analytics, especially for time/delay cost, token cost, money cost, and failure/complete rate?

## Agent's Two Cents (could be wrong)

> Everything below is the AI agent's best guess based on the current codebase.
> Take with a grain of salt — the original request above is the only thing that came from a human.

### Problem / Motivation

Hermes already tracks some usage/accounting data, but observability is fragmented. Today there are pieces of token accounting, estimated cost, session duration, and `/insights`, but there is no unified telemetry layer that answers the obvious operator questions: how long requests actually take, where latency comes from, how much they cost, and what percentage of runs end successfully vs timeout/abort/reset.

This matters because performance, reliability, and spend are now first-order product concerns. Without a canonical telemetry model, regressions become anecdotal and downstream users end up reverse-engineering behavior from logs and SQLite rows.

### What We Checked

- `run_agent.py` already accumulates `session_prompt_tokens`, `session_completion_tokens`, `session_total_tokens`, `session_api_calls`, and `session_estimated_cost_usd`.
- `run_agent.py` logs per-API-call latency via `logger.info(... latency=%.1fs ...)`, but this is log-level observability, not queryable analytics.
- `gateway/run.py` already exposes `/usage` and `/insights`.
- `agent/insights.py` already reports sessions/messages/tool calls, token totals, estimated cost, active time, average session duration, model/platform/tool breakdowns, and activity patterns.
- `hermes_state.py` already stores session `end_reason`, token counts, cost fields, and timestamps.
- What still seems missing: a canonical outcome taxonomy plus analytics for success/failure/completion rate; end-to-end latency breakdowns (TTFT, tool latency, queue/wait time, per-turn wall time); and one consistent telemetry surface across CLI, gateway, cron, delegated runs, and API server.

### Proposed Solution

Introduce a first-class Hermes telemetry/analytics subsystem with a canonical event schema and rollups for:

- **Latency**: per-turn wall time, per-API-call latency, TTFT, tool-call latency, idle/wait time, optional queueing delay
- **Usage**: input/output/cache/reasoning tokens, tool-call counts, message counts
- **Cost**: estimated and actual cost when available, with clear status/source
- **Outcome**: completed / timed_out / interrupted / reset / compressed / errored / unknown
- **Breakdowns**: by source platform, model, provider, toolset, command path, and session/job type

The initial UX can be modest: improve `/insights`, add a machine-readable export/CLI/API surface, and store normalized telemetry rows in the session DB. Fancy dashboards can come later.

### Dependencies & Potential Blockers

- Cross-cutting change touching the agent loop, session persistence, gateway, cron, delegated runs, and possibly API/plugin hooks.
- Outcome semantics need a stable definition first; otherwise the numbers will be garbage with a nicer font.
- Backward compatibility matters because `sessions` already stores partial accounting fields.
- No major external infrastructure blocker is required for a first local/SQLite-backed version.

### How to Validate

- Run comparable conversations from CLI, Slack/Telegram gateway, cron, and delegated-task paths; confirm telemetry is recorded consistently for all of them.
- Verify that a normal successful run increments the "completed" bucket.
- Verify that inactivity timeout, manual interruption, session reset, and compression continuation produce distinct outcome classifications.
- Confirm `/insights` (or a new export command) can report:
  - average and p95 latency
  - token totals
  - estimated cost totals
  - completion/failure rates by source and model
- Confirm existing `/usage` and `/insights` behavior does not regress for old sessions with sparse data.

### Best Validation Path

Best default path: add a deterministic integration test matrix over `SessionDB` + `InsightsEngine` + a small set of synthetic session transcripts/end reasons, then run one real smoke test per runtime path (CLI, gateway, cron, delegate) and assert that each produces the same normalized telemetry fields.

### Best Human Demo

A before-vs-after terminal demo is the cleanest proof:

1. run 3-4 scripted sessions that intentionally end in different ways (complete, timeout, interrupt, compression continuation)
2. run `/insights 7` or a new `hermes insights --json`
3. show one screen with latency, cost, and outcome breakdowns that were previously impossible to answer

### Scope Estimate

large

### Key Files/Modules Likely Involved

- `run_agent.py`
- `hermes_state.py`
- `agent/insights.py`
- `gateway/run.py`
- `gateway/session.py`

### Architecture Diagram

```text
+-------------------+        +-------------------+
| CLI / Gateway /   |        | Cron / API server |
| Delegate / MCP    |        | / batch runners   |
+---------+---------+        +---------+---------+
          |                            |
          +------------+  +------------+
                       v  v
                +-------------+
                | AIAgent loop |
                | run_agent.py |
                +------+------+ 
                       |
          +------------+-------------+
          |                          |
          v                          v
+--------------------+     +----------------------+
| tool / model events|     | outcome transitions  |
| latency, tokens,   |     | completed / timeout /|
| cost, model, etc.  |     | interrupt / reset... |
+----------+---------+     +----------+-----------+
           \                         /
            \                       /
             v                     v
             +---------------------+
             | normalized telemetry|
             | event + session     |
             | schema              |
             +----------+----------+
                        |
                        v
             +----------------------+
             | SessionDB / rollups  |
             | hermes_state.py      |
             +----------+-----------+
                        |
          +-------------+----------------+
          |                              |
          v                              v
+--------------------+       +-----------------------+
| InsightsEngine     |       | JSON/API export /     |
| human summaries    |       | downstream analysis   |
+--------------------+       +-----------------------+
```

### Rough Implementation Sketch

- Define a **canonical telemetry schema** for event-level and session-level rollups.
- Define a **canonical outcome taxonomy** and map existing `end_reason` values into it.
- Centralize telemetry emission in the agent/runtime paths instead of scattering ad hoc counters.
- Extend `SessionDB` schema and migration logic for normalized fields and/or event tables.
- Teach `InsightsEngine` to compute outcome rates, latency percentiles, and cost/usage breakdowns.
- Expose the results in `/insights`, CLI output, and ideally a machine-readable export.
- Add integration tests that intentionally exercise different termination paths.

### Open Questions

- Should this be session-rollup only first, or should Hermes store event-level telemetry from day one?
- How should "complete" be defined for interactive chats where the user simply stops talking?
- Should `compression`, `session_reset`, and `session_switch` count as neutral transitions or failed/completed outcomes?
- Is there appetite for OpenTelemetry/Langfuse-style sinks later, or should v1 stay local-only?

### Potential Risks or Gotchas

- This is broad enough that piecemeal contributor patches are likely to fight each other or calcify the wrong schema.
- Downstream/local carries are especially risky here because every runtime path must agree on semantics, and drift will make analytics actively misleading.
- Because of the cross-cutting blast radius, this feels better suited to an **author/core-maintainer-led design pass** than a casual contributor feature branch.
- If the project ships metrics before defining outcome semantics, people will trust numbers that do not mean what they think they mean.

### Related Issues

- #1501 — [Feature]: Add Langfuse tracing for subagents and gateway sessions
- #5451 — Provider runtime health observability
- #4169 — [Feature]: Include usage, user_id in post_llm_call plugin hook
- #3988 — Bug: /insights shows wrong model name — displays Gemini instead of actual active model (GLM-5.1 via zai provider)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates #6642

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

What We Checked

Proposed Solution

Dependencies & Potential Blockers

How to Validate

Best Validation Path

Best Human Demo

Scope Estimate

Key Files/Modules Likely Involved

Architecture Diagram

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates #6642

Description

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

What We Checked

Proposed Solution

Dependencies & Potential Blockers

How to Validate

Best Validation Path

Best Human Demo

Scope Estimate

Key Files/Modules Likely Involved

Architecture Diagram

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions