Skip to content

fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL#634

Merged
Aaronontheweb merged 3 commits into
devfrom
claude-wt-memory-recall-weak
Apr 13, 2026
Merged

fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL#634
Aaronontheweb merged 3 commits into
devfrom
claude-wt-memory-recall-weak

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Two independent 0.12.0 regressions surfaced together in production Slack sessions D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849:

  1. Tool-loop acknowledgement loop (PR feat(sessions): cache-stable message assembly with volatile User-role tail #618). SessionMessageAssembler emitted the volatile turn-context tail as a ChatRole.User message at the end of the list. During a tool loop, Qwen3's ChatML template read the trailing user-role block as a fresh user turn → model restarted its assistant response → scanned back for the last real user content → re-emitted "You're right — I had that backwards" before every tool call, until context hit 262144/262144 and force-compacted. Fix: flip the tail back to ChatRole.System, keep feat(sessions): cache-stable message assembly with volatile User-role tail #618's placement at the end of the list. llama.cpp KV cache is byte-level prefix matching, so the role tag at the end has no effect on cache stability — the cache win from feat(sessions): cache-stable message assembly with volatile User-role tail #618 is preserved.
  2. Memory curation blocked by legacy domain NOT NULL (PR refactor(memory): remove Domain concept in favor of audience isolation #588). The Domain refactor removed the column from the in-code schema but shipped no migration. CREATE TABLE IF NOT EXISTS is a no-op on existing DBs, so production databases still had domain TEXT NOT NULL on memory_anchors, memory_documents, memory_records, and memory_edges. Every new-anchor INSERT has been failing with SQLite Error 19 since the refactor — no new memories have been written since 2026-04-12 20:37 despite successful distillation proposals. Fix: one-off scripts/repair-memory-schema.sql that rename/rebuild/copy the four tables without the legacy column, preserving all rows. FTS5 virtual tables are standalone (no content= mode) and do not need touching. No formal migration because Netclaw is single-operator (explicit call from refactor(memory): remove Domain concept in favor of audience isolation #588).

Also cleans up the AcceptedDistillationProposalsRecorded dead-letter warning that fires on every curation write: the existing handler only lived inside Passivating(), so the observer's informational reply landed in dead letters during normal session states. Adds no-op Command<> registrations in Ready(), Processing(), and Compacting(), and captures Sender into a local in HandleRecordAcceptedDistillationProposals before Persist as defense against the standard Akka.NET Persist-callback Sender overwrite.

Changes

File Change
src/Netclaw.Actors/Sessions/SessionMessageAssembler.cs Flip volatile tail from ChatRole.User to ChatRole.System. One-line change + XML doc explaining why.
src/Netclaw.Actors.Tests/Sessions/SessionMessageAssemblerTests.cs Rename Volatile_tail_message_is_User_role_at_end_of_list_System_role_ and flip assertion. Update 3 existing tests that located the tail by role. Rewrite Recall_is_not_in_leading_system_prefix_even_when_resolved to walk only the leading contiguous system prefix. Add new regression test Volatile_tail_does_not_create_fake_user_turn_after_tool_result.
scripts/repair-memory-schema.sql New file. Rename/rebuild/copy memory_anchors, memory_documents, memory_records, memory_edges without domain. Drops domain-referencing indexes and recreates the non-domain ones.
src/Netclaw.Actors/Sessions/LlmSessionActor.cs No-op Command<AcceptedDistillationProposalsRecorded> handlers in Ready(), Processing(), Compacting().
src/Netclaw.Actors/Sessions/SessionMemoryObserverActor.cs Capture Sender into local replyTo before Persist in HandleRecordAcceptedDistillationProposals.

Test plan

  • dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~SessionMessageAssemblerTests" → 8/8 passing (5 existing + new regression test)
  • dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~Sessions" → 253/253 passing (no collateral breakage from LlmSessionActor / observer edits)
  • dotnet slopwatch analyze → 0 issues
  • Dry-run scripts/repair-memory-schema.sql against a copy of the live ~/.netclaw/netclaw.db: row counts preserved (anchors=91, documents=74, records=24, edges=0), zero domain columns remain, all indexes rebuilt
  • Apply repair to live DB and confirm daemon forms new memories after restart
  • Run a Slack session that requires multi-step tool work after a user correction; confirm no repeating-opener pattern inside a single turn
  • Confirm cache_n KV-cache telemetry remains comparable to post-feat(sessions): cache-stable message assembly with volatile User-role tail #618 baseline (not pre-feat(sessions): cache-stable message assembly with volatile User-role tail #618)

…NULL

Two independent regressions surfaced together in production Slack sessions
D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849.

1. Tool-loop acknowledgement loop (PR #618 regression). SessionMessageAssembler
   emitted the volatile turn-context tail (memory recall, current time,
   working-context) as a ChatRole.User message at the end of the assembled
   list. During a tool loop, Qwen3's ChatML template read the trailing
   user-role block as a fresh user turn, so the model restarted its assistant
   response on every iteration, scanned back for the last real user content,
   and re-emitted "You're right - I had that backwards" before each tool call
   until context hit 262144/262144 and was force-compacted. Flip the role back
   to System; keep the placement at the end of the list so #618's cache
   stability win is preserved (llama.cpp KV cache is byte-level prefix
   matching, role tag at the end does not affect prefix stability). Update 4
   existing SessionMessageAssembler tests for the role flip and add a new
   Volatile_tail_does_not_create_fake_user_turn_after_tool_result regression
   test that builds a mid-tool-loop history and asserts the trailing message
   is System-role with no User-role after the last Tool/Assistant message.

2. Legacy memory_anchors.domain NOT NULL constraint. PR #588 removed the
   Domain concept from the in-code schema but intentionally shipped no
   migration. CREATE TABLE IF NOT EXISTS is a no-op on existing databases,
   so production DBs still had `domain TEXT NOT NULL` on memory_anchors,
   memory_documents, memory_records, and memory_edges. Every new-anchor
   INSERT has been failing with SQLite Error 19 since the refactor,
   blocking memory curation entirely - no new memories have been written
   since 2026-04-12 20:37 despite successful distillation proposals. Add
   scripts/repair-memory-schema.sql, a one-off rename/rebuild/copy repair
   that drops the legacy column from all four tables while preserving
   rows. FTS5 virtual tables are standalone (no content= mode) and do not
   need touching.

Also handle the AcceptedDistillationProposalsRecorded dead letter: the
observer correctly replies to LlmSessionActor, but the existing handler
only lived inside Passivating(). Add no-op Command<> registrations in
Ready(), Processing(), and Compacting() so the informational reply does
not land in DeadLetters during normal session states. Capture Sender into
a local in SessionMemoryObserverActor.HandleRecordAcceptedDistillationProposals
before calling Persist - the standard Akka.NET defense against Sender
being overwritten by an interleaved message before the persist callback
fires.
Review cleanup on top of the prior volatile-tail and NOT-NULL fixes.

- LlmSessionActor: extract the three identical no-op
  `Command<AcceptedDistillationProposalsRecorded>(_ => { })` registrations
  from Ready/Processing/Compacting into a single `CommandDistillationAckNoOp()`
  helper with one canonical comment explaining why the non-passivation
  states need to swallow the observer's informational reply. Net: 3
  copies of the handler and 3 near-duplicate comment blocks collapse
  to 1 helper + 1 comment + 3 single-line call sites.

- SessionMessageAssemblerTests:
  * Recall_is_not_in_leading_system_prefix_even_when_resolved: drop the
    StringBuilder in favour of the per-message / break-on-non-System
    pattern already established by AssertNoVolatileContentInSystemPrefix.
    Same invariant, less allocation, same shape as the rest of the file.
  * Prefix_extends_through_history_when_startup_layers_settled,
    Volatile_tail_message_is_System_role_at_end_of_list, and
    Working_context_update_does_not_poison_system_prefix: trim WHAT-
    narrating comments that the identifiers and assertions already
    convey. Load-bearing WHY comments on the new regression test, the
    SessionMessageAssembler XML doc, and the SessionMemoryObserverActor
    Sender-capture rationale are preserved.

No behavioural change. Verified with
`dotnet test --filter "FullyQualifiedName~Sessions"` (253/253 passing)
and `dotnet slopwatch analyze` (0 issues).
@Aaronontheweb Aaronontheweb marked this pull request as ready for review April 13, 2026 14:34
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) April 13, 2026 14:34
@Aaronontheweb Aaronontheweb merged commit f7be64e into dev Apr 13, 2026
4 checks passed
@Aaronontheweb Aaronontheweb deleted the claude-wt-memory-recall-weak branch April 13, 2026 14:38
Aaronontheweb added a commit that referenced this pull request May 28, 2026
…l-loop spin (#1216) (#1218)

The per-turn volatile context block (memory recall, current time, working
context, etc.) was persisted into history as a User-role nudge AFTER the real
user message (PR #1178). Strict ChatML templates (Qwen3) read that trailing
user-role block as a fresh user turn, so the model restarted its assistant
response and re-narrated the same plan on every tool-loop iteration until
context filled and force-compacted — the production spin on D0AC6CKBK5K. Same
class of bug as #634, relocated by #1178.

Add SessionState.AddVolatileContextNudge, which inserts the volatile block
BEFORE the most recent real user message (appending only when the last entry
isn't a real user message: reminder/scheduled/cold-recovery/delivery-retry).
The real user message stays at the tail, so every chat template anchors to it
correctly. AddSystemNudge stays append-only for mid-turn corrective nudges
(empty-response, duplicate-tool, budget, delivery-retry), which need the tail
position to course-correct.

KV-cache prefix stability from #1178 is preserved: the inserted message's byte
position is fixed once added and subsequent turns only append, so byte-prefix
caching providers extend the cached prefix through it unchanged.

Affected: 0.20.3, 0.21.0, 0.21.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant