Skip to content

feat(observability): structured session tracing with start/end timestamps #6741

@xinbenlv

Description

@xinbenlv

Original Request

is it possible to actually strutualize it to include start and end timestamps? So we can profile and improve in the future?

Agent's Two Cents (could be wrong)

Everything below is the AI agent's best guess based on the current codebase.
Take with a grain of salt — the original request above is the only thing that came from a human.

Problem / Motivation

Hermes keeps useful conversation transcripts, but the current session schema is still transcript-first rather than trace-first. That makes it annoying to answer basic performance questions like where a slow turn spent time: prompt assembly, model latency, tool dispatch, terminal startup, browser actions, or persistence.

What We Checked

  • README.md is English-primary, so the issue is written in English.
  • gateway/session.py currently appends JSONL transcript rows and mirrors some fields into SQLite.
  • hermes_state.py stores message-level fields like role, content, tool_call_id, tool_calls, tool_name, timestamp, finish_reason, and reasoning payloads.
  • Current Hermes logs mostly have a single timestamp per message/tool row. Some tool outputs include ad-hoc duration_seconds, but this is not standardized.
  • OpenClaw-style logs are more event-shaped and often include toolCallId, parentId, and per-tool durationMs, which makes profiling materially easier.
  • Related open issues already exist for broader observability work, but they do not appear to define a concrete session-trace schema for start/end timing:

Proposed Solution

Add a structured trace/span layer to Hermes session logging so every meaningful work unit can record start_ts, end_ts, and duration_ms, with stable IDs and parent-child relationships. Keep the existing transcript view for compatibility, but add explicit event records for profiling.

Dependencies & Potential Blockers

  • Session JSONL and SQLite schema changes need backward-compatible migration.
  • We should avoid turning this into a full OpenTelemetry dependency spike on day one.
  • Logging must fail open and never break normal agent execution.

How to Validate

  • A single user turn produces enough structured timing data to reconstruct a waterfall of: prompt build -> model call -> tool dispatch -> tool result -> final response.
  • Tool rows include explicit timing fields instead of relying on inferred timestamps.
  • Nested operations for heavy tools (at least terminal and browser) can be timed independently.
  • Existing session loading/search features continue to work with old transcripts.
  • New fields are stored in both JSONL and SQLite, or there is a clearly documented split of responsibilities.

Best Validation Path

Run one CLI session that triggers at least one model call and one tool call, then inspect the session artifacts directly. The best default smoke test is: start Hermes, run a prompt that triggers search_files or terminal, then verify the resulting session log contains structured timing fields and that a small analysis helper can print a per-turn waterfall without reconstructing timing from guesswork.

Best Human Demo

A terminal demo that prints a compact waterfall for the last session, for example:

turn 7
  prompt.build      12ms
  model.call      842ms
  tool.terminal     31ms
  response.render    4ms

That is much more persuasive than raw JSON dumps.

Scope Estimate

medium

Key Files/Modules Likely Involved

  • run_agent.py
  • gateway/session.py
  • hermes_state.py
  • model_tools.py
  • tools/terminal_tool.py
  • tools/browser_tool.py

Architecture Diagram

User Turn
   |
   v
+------------------+
| run_agent.py     |
| agent loop       |
+------------------+
   |        | \
   |        |  \__ final response span
   |        |
   |        +---- model call span
   |
   +------------- tool dispatch span
                    |
          +---------+----------+
          |                    |
          v                    v
   +-------------+      +--------------+
   | terminal    |      | browser       |
   | subspans    |      | subspans      |
   +-------------+      +--------------+
          \                    /
           \                  /
            v                v
         +------------------------+
         | session persistence    |
         | JSONL + SQLite         |
         +------------------------+

Rough Implementation Sketch

  • Introduce a minimal internal span/event schema with stable IDs plus start_ts, end_ts, and duration_ms.
  • Add helper utilities for starting/finishing spans using wall-clock timestamps plus monotonic elapsed time.
  • Instrument core agent phases first: prompt build, model call, tool dispatch, tool execution, final response.
  • Add nested spans inside expensive tools like terminal and browser.
  • Extend SQLite schema and JSONL output to preserve these records without breaking existing transcript consumers.
  • Add a small inspection/reporting utility so maintainers can actually use the new data.

Open Questions

  • Should spans live alongside transcript rows in the same JSONL file, or in a sibling trace file?
  • Should SQLite store full span metadata, or just a summarized/indexed subset?
  • How much nested instrumentation is worth doing in v1 versus later?
  • Should this be Hermes-native only at first, or aligned with future OTel/Langfuse integration from the start?

Potential Risks or Gotchas

  • Naively logging full tool metadata could leak secrets unless redaction is applied consistently.
  • Too much fine-grained tracing can create noise and write amplification.
  • If timing is based only on wall clock instead of monotonic elapsed time, the data will be flaky.
  • Schema churn in a hot path can silently break session restore or search if migration is sloppy.

Maintainer Ownership Recommendation

This touches core runtime semantics, persistence schema, and potentially many tool boundaries. It is implementable as a downstream carried patch, but the long-term shape should probably get a maintainer-level design pass before upstreaming so we do not ossify a mediocre schema.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildertype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions