feat(observability): structured session tracing with start/end timestamps

## Original Request

> is it possible to actually strutualize it to include start and end timestamps? So we can profile and improve in the future?

## Agent's Two Cents (could be wrong)

> Everything below is the AI agent's best guess based on the current codebase.
> Take with a grain of salt — the original request above is the only thing that came from a human.

### Problem / Motivation

Hermes keeps useful conversation transcripts, but the current session schema is still transcript-first rather than trace-first. That makes it annoying to answer basic performance questions like where a slow turn spent time: prompt assembly, model latency, tool dispatch, terminal startup, browser actions, or persistence.

### What We Checked

- `README.md` is English-primary, so the issue is written in English.
- `gateway/session.py` currently appends JSONL transcript rows and mirrors some fields into SQLite.
- `hermes_state.py` stores message-level fields like `role`, `content`, `tool_call_id`, `tool_calls`, `tool_name`, `timestamp`, `finish_reason`, and reasoning payloads.
- Current Hermes logs mostly have a single `timestamp` per message/tool row. Some tool outputs include ad-hoc `duration_seconds`, but this is not standardized.
- OpenClaw-style logs are more event-shaped and often include `toolCallId`, `parentId`, and per-tool `durationMs`, which makes profiling materially easier.
- Related open issues already exist for broader observability work, but they do not appear to define a concrete session-trace schema for start/end timing:
  - #6642 `feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates`
  - #1501 `Add Langfuse tracing for subagents and gateway sessions`

### Proposed Solution

Add a structured trace/span layer to Hermes session logging so every meaningful work unit can record `start_ts`, `end_ts`, and `duration_ms`, with stable IDs and parent-child relationships. Keep the existing transcript view for compatibility, but add explicit event records for profiling.

### Dependencies & Potential Blockers

- Session JSONL and SQLite schema changes need backward-compatible migration.
- We should avoid turning this into a full OpenTelemetry dependency spike on day one.
- Logging must fail open and never break normal agent execution.

### How to Validate

- A single user turn produces enough structured timing data to reconstruct a waterfall of: prompt build -> model call -> tool dispatch -> tool result -> final response.
- Tool rows include explicit timing fields instead of relying on inferred timestamps.
- Nested operations for heavy tools (at least terminal and browser) can be timed independently.
- Existing session loading/search features continue to work with old transcripts.
- New fields are stored in both JSONL and SQLite, or there is a clearly documented split of responsibilities.

### Best Validation Path

Run one CLI session that triggers at least one model call and one tool call, then inspect the session artifacts directly. The best default smoke test is: start Hermes, run a prompt that triggers `search_files` or `terminal`, then verify the resulting session log contains structured timing fields and that a small analysis helper can print a per-turn waterfall without reconstructing timing from guesswork.

### Best Human Demo

A terminal demo that prints a compact waterfall for the last session, for example:

```text
turn 7
  prompt.build      12ms
  model.call      842ms
  tool.terminal     31ms
  response.render    4ms
```

That is much more persuasive than raw JSON dumps.

### Scope Estimate

medium

### Key Files/Modules Likely Involved

- `run_agent.py`
- `gateway/session.py`
- `hermes_state.py`
- `model_tools.py`
- `tools/terminal_tool.py`
- `tools/browser_tool.py`

### Architecture Diagram

```text
User Turn
   |
   v
+------------------+
| run_agent.py     |
| agent loop       |
+------------------+
   |        | \
   |        |  \__ final response span
   |        |
   |        +---- model call span
   |
   +------------- tool dispatch span
                    |
          +---------+----------+
          |                    |
          v                    v
   +-------------+      +--------------+
   | terminal    |      | browser       |
   | subspans    |      | subspans      |
   +-------------+      +--------------+
          \                    /
           \                  /
            v                v
         +------------------------+
         | session persistence    |
         | JSONL + SQLite         |
         +------------------------+
```

### Rough Implementation Sketch

- Introduce a minimal internal span/event schema with stable IDs plus `start_ts`, `end_ts`, and `duration_ms`.
- Add helper utilities for starting/finishing spans using wall-clock timestamps plus monotonic elapsed time.
- Instrument core agent phases first: prompt build, model call, tool dispatch, tool execution, final response.
- Add nested spans inside expensive tools like terminal and browser.
- Extend SQLite schema and JSONL output to preserve these records without breaking existing transcript consumers.
- Add a small inspection/reporting utility so maintainers can actually use the new data.

### Open Questions

- Should spans live alongside transcript rows in the same JSONL file, or in a sibling trace file?
- Should SQLite store full span metadata, or just a summarized/indexed subset?
- How much nested instrumentation is worth doing in v1 versus later?
- Should this be Hermes-native only at first, or aligned with future OTel/Langfuse integration from the start?

### Potential Risks or Gotchas

- Naively logging full tool metadata could leak secrets unless redaction is applied consistently.
- Too much fine-grained tracing can create noise and write amplification.
- If timing is based only on wall clock instead of monotonic elapsed time, the data will be flaky.
- Schema churn in a hot path can silently break session restore or search if migration is sloppy.

### Maintainer Ownership Recommendation

This touches core runtime semantics, persistence schema, and potentially many tool boundaries. It is implementable as a downstream carried patch, but the long-term shape should probably get a maintainer-level design pass before upstreaming so we do not ossify a mediocre schema.

### Related Issues

- #6642 `feat(observability): unified telemetry + analytics for latency, cost, and completion/failure rates`
- #1501 `Add Langfuse tracing for subagents and gateway sessions`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): structured session tracing with start/end timestamps #6741

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

What We Checked

Proposed Solution

Dependencies & Potential Blockers

How to Validate

Best Validation Path

Best Human Demo

Scope Estimate

Key Files/Modules Likely Involved

Architecture Diagram

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Maintainer Ownership Recommendation

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(observability): structured session tracing with start/end timestamps #6741

Description

Original Request

Agent's Two Cents (could be wrong)

Problem / Motivation

What We Checked

Proposed Solution

Dependencies & Potential Blockers

How to Validate

Best Validation Path

Best Human Demo

Scope Estimate

Key Files/Modules Likely Involved

Architecture Diagram

Rough Implementation Sketch

Open Questions

Potential Risks or Gotchas

Maintainer Ownership Recommendation

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions