fix(compression): eliminate session duplication -- adopt in-place compaction like Claude Code and Codex

## Problem

Every time Hermes compresses context, it creates a **new session** chained via `parent_session_id`. This spawns duplicate entries in the WebUI sidebar (`My Chat`, `My Chat #2`, `My Chat #3`, etc.) for a single conversation. Over time this accumulates hundreds of orphan child sessions in `state.db`.

This is not how other agentic coding tools handle it, and it causes a cascade of downstream bugs:

- #33618 -- /goal lost after compression rotates session_id
- #34089 -- session ID desync between agent and gateway after split
- #25921 -- infinite preflight compression loop from reusing parent history
- #14238 -- pending response lost at session split boundary
- #33907 -- orphan sessions missing from state.db
- #36777 -- TUI session.info doesn't update after compression fork

## How Claude Code and Codex solve this

**Both tools keep the same session/thread ID for the entire conversation.** Compression never creates a new session.

### Claude Code (Anthropic Messages API)

- Single `session_id` for the entire conversation. No `parent_session_id` field exists in the schema.
- When context fills up, Claude Code calls the API with `context_management` containing a `compact_20260112` strategy.
- The API handles it server-side: summarizes old messages into a `compaction` content block, returns it alongside the response.
- Client appends the compaction block to the messages array and keeps going in the **same session**.
- All messages before the latest compaction block are transparently ignored by the API on subsequent requests.
- Evidence: 9,131 sessions in a Claude Code install, none with parent chains. Longest session has 4,033 messages in a single session_id.

### Codex (OpenAI Responses API)

- Single `thread_id` for the entire conversation. No `parent_thread_id` field (only `thread_spawn_edges` for actual subagent delegation).
- When context fills up, Codex calls the API with `compact_threshold` set.
- The API returns a `compaction` item that the client appends to the conversation.
- Everything before the compaction item gets dropped on the next request.
- Thread ID stays the same forever. Sessions with 500M+ tokens exist as single rows in the `threads` table.
- Compaction blobs are AES-encrypted server-side so the client never even sees the summary text.

### Key Insight

Both tools leverage their provider's **native server-side compaction API**. The provider returns a compaction artifact that the client injects into the message stream. The session/thread ID never changes.

## Proposed Fix

Hermes should adopt in-place compaction for **all models**, not just providers with native compaction APIs.

### For providers with native compaction (Anthropic, OpenAI)

Use the provider's built-in compaction API (Claude's `context_management`, OpenAI's `compact_threshold`). This is the ideal path -- the provider handles summarization and the client just manages message truncation. Zero custom summarization logic needed.

### For providers without native compaction (z.ai/GLM, Google, etc.)

Replicate the same pattern in client code:

1. **Generate summary** -- use the model to summarize the conversation history (similar to what Hermes already does in `conversation_compression.py`)
2. **Inject summary as a system message** -- prepend a compaction block to the messages array
3. **Truncate old messages** -- drop all messages before the summary block from the in-memory messages array
4. **Keep the same session_id** -- do NOT end the session, do NOT create a new session row, do NOT set `end_reason="compression"`, do NOT chain via `parent_session_id`
5. **Continue conversation** -- the next user message appends after the summary block as normal

### Config changes

Add a new option to control the strategy:

```yaml
compression:
  enabled: true
  threshold: 0.95
  mode: inplace  # "inplace" (default, no session split) | "split" (legacy, creates new session)
```

This preserves backward compat if anyone relies on the split behavior, but defaults to the new in-place approach.

### What changes in the code

In `conversation_compression.py`, the `compress_context` function currently:

```
1. Generate summary
2. end_session(end_reason="compression")
3. create_session(title=f"{title} #{n}", parent_session_id=old_id)
4. Reset flush cursors
```

It should instead:

```
1. Generate summary (existing logic, keep this)
2. Inject summary as system message at start of messages array
3. Truncate all messages before the summary
4. Update session title if desired (same session_id)
5. Continue in the same session
```

Lines 381-396 (the session split logic) get replaced with the in-place truncation. Everything else (summary generation, threshold calculation, protection of recent messages) stays the same.

### Schema changes

- `parent_session_id` column becomes unused for compression. Can be deprecated or kept for backward compat with existing session chains.
- No new columns needed. The same `messages` storage in the session works -- compaction is just another set of messages in the array.

## Benefits

1. **No more duplicate sessions** -- one conversation = one session row in the DB, from first message to last
2. **Fixes all related bugs** -- session ID desync, goal loss, preflight loops, orphan sessions, gateway desync all become impossible since the session_id never changes mid-conversation
3. **Cleaner WebUI** -- sidebar shows actual conversations, not chains of numbered copies
4. **Simpler state management** -- no parent_session_id lineage to track, no session chain queries needed
5. **Consistent with industry standard** -- Claude Code and Codex both work this way
6. **No data loss** -- compaction summaries are preserved as messages in the session, not lost when old sessions get pruned

## Environment

- Hermes v0.14.0
- Claude Code v2.1.143 (for reference behavior)
- Codex CLI v0.130.0 (for reference behavior)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compression): eliminate session duplication -- adopt in-place compaction like Claude Code and Codex #38763

Problem

How Claude Code and Codex solve this

Claude Code (Anthropic Messages API)

Codex (OpenAI Responses API)

Key Insight

Proposed Fix

For providers with native compaction (Anthropic, OpenAI)

For providers without native compaction (z.ai/GLM, Google, etc.)

Config changes

What changes in the code

Schema changes

Benefits

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(compression): eliminate session duplication -- adopt in-place compaction like Claude Code and Codex #38763

Description

Problem

How Claude Code and Codex solve this

Claude Code (Anthropic Messages API)

Codex (OpenAI Responses API)

Key Insight

Proposed Fix

For providers with native compaction (Anthropic, OpenAI)

For providers without native compaction (z.ai/GLM, Google, etc.)

Config changes

What changes in the code

Schema changes

Benefits

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions