Skip to content

[Critical UX] Memory persistence, token waste from session replay, state.db corruption, and environment hallucination field report from heavy production use #5563

@JuanDragin

Description

@JuanDragin

Context & Acknowledgment

First, I want to say: Hermes is an extraordinary piece of work. The skill system, persistent memory, session search, delegate_task subagents, the gateway architecture — it's the most capable CLI AI agent I've used. I run it daily for production software development (orchestrating a 3-actor email processing pipeline with DBOS, PostgreSQL, S3, Gmail API), and it consistently delivers. The team at Nous Research has built something genuinely special.

That said, after 3 weeks of heavy daily use (8+ hours/day on Claude Opus), I've hit a cluster of interrelated issues around memory persistence and context management that together cause severe token waste and, in one case, actual hallucination about the execution environment. I'm reporting them as one issue because they compound each other.

Problem Summary

During a ~12-hour intensive session (Apr 5, 2026), I lost approximately 2.6M tokens (~69% of total consumption) to context replay overhead, and at one point Hermes hallucinated that it was running in a cloud container with "outdated information" — after hours of productive work on my local WSL2 machine.

Issue 1: Session Fragmentation Causes Exponential Token Replay

What happens

Long CLI conversations get silently fragmented into multiple sessions. Each new session replays the entire conversation history as input tokens.

Observed data (single day, Apr 5 2026)

Conversation A ("stale checker refactoring"):

  • 15 sessions with the same first user message
  • User message count grows: 10 → 11 → 15 → 20 → ... → 54 → 57
  • File sizes: 170KB → 259KB → 434KB → 576KB → 728KB (each is the FULL history)
  • ~1.9M tokens consumed, only ~190K were necessary (89% waste)

Conversation B ("checker moving emails"):

  • 9 sessions, same pattern
  • ~1.1M tokens consumed, only ~165K necessary (84% waste)

Why it happens

When a session ends (auto-review trigger, max iterations reached, error, etc.), the next session replays the full message history to maintain context. The user doesn't see this — the conversation appears continuous in the terminal.

Impact

With Claude Opus pricing, this turned a productive 12-hour workday into burning through an entire monthly API budget in one sitting. The user has no visibility into when session boundaries occur or how much replay is happening.

Suggested fixes

Issue 2: state.db Corruption Kills session_search

What happens

The SQLite state.db database becomes corrupted during normal use, making session_search completely non-functional. The PRAGMA integrity_check reports malformed B-tree pages.

Observed data

state.db (24MB) — integrity_check FAILED
  "Tree 12 page 5541: btreeInitPage() returns error code 11"
  Multiple corrupted pages in the messages table and FTS index

Recovery

Manual recovery via .dump + filtering corrupted rows + FTS rebuild recovered 110/128 sessions and 6,645 messages. 18 sessions were permanently lost from the DB (though JSON session files on disk were intact).

Why it matters

session_search is the ONLY way for Hermes to recall cross-session context. When it breaks, the agent loses all long-term recall, forcing the user to manually re-explain project context every session. For complex multi-day projects, this is devastating.

Likely cause

WAL mode + concurrent writes from CLI + gateway + subagent processes accessing the same DB file. The symlink setup (state.db → hermes-sync/state.db) may contribute.

Suggested fixes

  • Add periodic PRAGMA integrity_check and auto-repair (the JSON session files can serve as source of truth)
  • Use WAL2 mode or ensure proper locking across all processes accessing the DB
  • Add a hermes db repair CLI command
  • Consider making JSON session files the primary store with SQLite as a search index that can be rebuilt

Issue 3: MEMORY.md Size Limit (2,200 chars) is Critically Small

What happens

The persistent memory store (MEMORY.md) has a ~2,200 character limit. For a complex multi-service project (3 actors, PostgreSQL, S3, Gmail API, stale checker with 5 checks, multiple status enums, credential resolution patterns), this forces extreme compression that loses critical context.

Real example

My MEMORY.md at 90% capacity contains compressed fragments like:

PG+S3, NO KAFKA/REDIS. PALS=orchestrator locks, DBOS partition_queue(concurrency=1). 
workflows/: reader_response.py+doctype_response.py+common.py. 
stale_checker.py: pg_try_advisory_lock(900100001), 5 checks(...)

This is the ENTIRE project architecture compressed into telegram-style abbreviations. Critical details that should be in memory (like "classification_status uses 'completed' NOT 'classified'") barely fit alongside everything else.

Workaround

Skills serve as extended memory (~20KB+ each), but they're loaded on-demand and require trigger matching. They don't replace the "always present" nature of memory.

Suggested fixes

Issue 4: Environment Hallucination in Long Sessions

What happens

After hours of continuous work on a local WSL2 machine (terminal.backend: local), Hermes told the user they were running in a "cloud container with outdated information" — which was completely false.

Root cause analysis

The terminal tool description contains phrases like:

"cloud sandboxes may be cleaned up, idled out, or recreated between turns"

And execute_code runs in /tmp/hermes_sandbox_* paths. After 700K+ tokens of context, the model appears to confuse tool description warnings with its actual execution environment. Additionally, when subagents modify files but the main conversation has stale read_file results from earlier turns, the model may interpret the discrepancy as "being in a different environment" rather than "my cached context is stale."

Impact

The user spent hours working productively, only to be told (incorrectly) that none of the work was reliable because "we're in a cloud environment with outdated files." This destroyed confidence in the session's output.

Suggested fixes

  • Inject a clear, authoritative [ENVIRONMENT: local] marker in each turn (not just in the tool descriptions)
  • When terminal.backend: local, strip/modify the cloud sandbox warnings from tool descriptions
  • Add a "context freshness" indicator — flag when file reads are older than N minutes in the conversation

Environment

  • Hermes Agent v0.6.0+
  • Model: Claude Opus 4 (anthropic/claude-opus-4.6) via Anthropic API
  • OS: Ubuntu 24.04 (WSL2 on Windows 11)
  • Terminal backend: local
  • Usage pattern: 8+ hours/day, heavy delegate_task usage, complex multi-file codebase work

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildercomp/gatewayGateway runner, session dispatch, deliverytool/memoryMemory tool and memory providerstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions