fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL by Aaronontheweb · Pull Request #634 · netclaw-dev/netclaw

Aaronontheweb · 2026-04-13T13:35:56Z

Summary

Two independent 0.12.0 regressions surfaced together in production Slack sessions D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849:

Tool-loop acknowledgement loop (PR feat(sessions): cache-stable message assembly with volatile User-role tail #618). SessionMessageAssembler emitted the volatile turn-context tail as a ChatRole.User message at the end of the list. During a tool loop, Qwen3's ChatML template read the trailing user-role block as a fresh user turn → model restarted its assistant response → scanned back for the last real user content → re-emitted "You're right — I had that backwards" before every tool call, until context hit 262144/262144 and force-compacted. Fix: flip the tail back to ChatRole.System, keep feat(sessions): cache-stable message assembly with volatile User-role tail #618's placement at the end of the list. llama.cpp KV cache is byte-level prefix matching, so the role tag at the end has no effect on cache stability — the cache win from feat(sessions): cache-stable message assembly with volatile User-role tail #618 is preserved.
Memory curation blocked by legacy domain NOT NULL (PR refactor(memory): remove Domain concept in favor of audience isolation #588). The Domain refactor removed the column from the in-code schema but shipped no migration. CREATE TABLE IF NOT EXISTS is a no-op on existing DBs, so production databases still had domain TEXT NOT NULL on memory_anchors, memory_documents, memory_records, and memory_edges. Every new-anchor INSERT has been failing with SQLite Error 19 since the refactor — no new memories have been written since 2026-04-12 20:37 despite successful distillation proposals. Fix: one-off scripts/repair-memory-schema.sql that rename/rebuild/copy the four tables without the legacy column, preserving all rows. FTS5 virtual tables are standalone (no content= mode) and do not need touching. No formal migration because Netclaw is single-operator (explicit call from refactor(memory): remove Domain concept in favor of audience isolation #588).

Also cleans up the AcceptedDistillationProposalsRecorded dead-letter warning that fires on every curation write: the existing handler only lived inside Passivating(), so the observer's informational reply landed in dead letters during normal session states. Adds no-op Command<> registrations in Ready(), Processing(), and Compacting(), and captures Sender into a local in HandleRecordAcceptedDistillationProposals before Persist as defense against the standard Akka.NET Persist-callback Sender overwrite.

Changes

File	Change
`src/Netclaw.Actors/Sessions/SessionMessageAssembler.cs`	Flip volatile tail from `ChatRole.User` to `ChatRole.System`. One-line change + XML doc explaining why.
`src/Netclaw.Actors.Tests/Sessions/SessionMessageAssemblerTests.cs`	Rename `Volatile_tail_message_is_User_role_at_end_of_list` → `_System_role_` and flip assertion. Update 3 existing tests that located the tail by role. Rewrite `Recall_is_not_in_leading_system_prefix_even_when_resolved` to walk only the leading contiguous system prefix. Add new regression test `Volatile_tail_does_not_create_fake_user_turn_after_tool_result`.
`scripts/repair-memory-schema.sql`	New file. Rename/rebuild/copy `memory_anchors`, `memory_documents`, `memory_records`, `memory_edges` without `domain`. Drops domain-referencing indexes and recreates the non-domain ones.
`src/Netclaw.Actors/Sessions/LlmSessionActor.cs`	No-op `Command<AcceptedDistillationProposalsRecorded>` handlers in `Ready()`, `Processing()`, `Compacting()`.
`src/Netclaw.Actors/Sessions/SessionMemoryObserverActor.cs`	Capture `Sender` into local `replyTo` before `Persist` in `HandleRecordAcceptedDistillationProposals`.

Test plan

dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~SessionMessageAssemblerTests" → 8/8 passing (5 existing + new regression test)
dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~Sessions" → 253/253 passing (no collateral breakage from LlmSessionActor / observer edits)
dotnet slopwatch analyze → 0 issues
Dry-run scripts/repair-memory-schema.sql against a copy of the live ~/.netclaw/netclaw.db: row counts preserved (anchors=91, documents=74, records=24, edges=0), zero domain columns remain, all indexes rebuilt
Apply repair to live DB and confirm daemon forms new memories after restart
Run a Slack session that requires multi-step tool work after a user correction; confirm no repeating-opener pattern inside a single turn
Confirm cache_n KV-cache telemetry remains comparable to post-feat(sessions): cache-stable message assembly with volatile User-role tail #618 baseline (not pre-feat(sessions): cache-stable message assembly with volatile User-role tail #618)

…NULL Two independent regressions surfaced together in production Slack sessions D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849. 1. Tool-loop acknowledgement loop (PR #618 regression). SessionMessageAssembler emitted the volatile turn-context tail (memory recall, current time, working-context) as a ChatRole.User message at the end of the assembled list. During a tool loop, Qwen3's ChatML template read the trailing user-role block as a fresh user turn, so the model restarted its assistant response on every iteration, scanned back for the last real user content, and re-emitted "You're right - I had that backwards" before each tool call until context hit 262144/262144 and was force-compacted. Flip the role back to System; keep the placement at the end of the list so #618's cache stability win is preserved (llama.cpp KV cache is byte-level prefix matching, role tag at the end does not affect prefix stability). Update 4 existing SessionMessageAssembler tests for the role flip and add a new Volatile_tail_does_not_create_fake_user_turn_after_tool_result regression test that builds a mid-tool-loop history and asserts the trailing message is System-role with no User-role after the last Tool/Assistant message. 2. Legacy memory_anchors.domain NOT NULL constraint. PR #588 removed the Domain concept from the in-code schema but intentionally shipped no migration. CREATE TABLE IF NOT EXISTS is a no-op on existing databases, so production DBs still had `domain TEXT NOT NULL` on memory_anchors, memory_documents, memory_records, and memory_edges. Every new-anchor INSERT has been failing with SQLite Error 19 since the refactor, blocking memory curation entirely - no new memories have been written since 2026-04-12 20:37 despite successful distillation proposals. Add scripts/repair-memory-schema.sql, a one-off rename/rebuild/copy repair that drops the legacy column from all four tables while preserving rows. FTS5 virtual tables are standalone (no content= mode) and do not need touching. Also handle the AcceptedDistillationProposalsRecorded dead letter: the observer correctly replies to LlmSessionActor, but the existing handler only lived inside Passivating(). Add no-op Command<> registrations in Ready(), Processing(), and Compacting() so the informational reply does not land in DeadLetters during normal session states. Capture Sender into a local in SessionMemoryObserverActor.HandleRecordAcceptedDistillationProposals before calling Persist - the standard Akka.NET defense against Sender being overwritten by an interleaved message before the persist callback fires.

Review cleanup on top of the prior volatile-tail and NOT-NULL fixes. - LlmSessionActor: extract the three identical no-op `Command<AcceptedDistillationProposalsRecorded>(_ => { })` registrations from Ready/Processing/Compacting into a single `CommandDistillationAckNoOp()` helper with one canonical comment explaining why the non-passivation states need to swallow the observer's informational reply. Net: 3 copies of the handler and 3 near-duplicate comment blocks collapse to 1 helper + 1 comment + 3 single-line call sites. - SessionMessageAssemblerTests: * Recall_is_not_in_leading_system_prefix_even_when_resolved: drop the StringBuilder in favour of the per-message / break-on-non-System pattern already established by AssertNoVolatileContentInSystemPrefix. Same invariant, less allocation, same shape as the rest of the file. * Prefix_extends_through_history_when_startup_layers_settled, Volatile_tail_message_is_System_role_at_end_of_list, and Working_context_update_does_not_poison_system_prefix: trim WHAT- narrating comments that the identifiers and assertions already convey. Load-bearing WHY comments on the new regression test, the SessionMessageAssembler XML doc, and the SessionMemoryObserverActor Sender-capture rationale are preserved. No behavioural change. Verified with `dotnet test --filter "FullyQualifiedName~Sessions"` (253/253 passing) and `dotnet slopwatch analyze` (0 issues).

…l-loop spin (#1216) (#1218) The per-turn volatile context block (memory recall, current time, working context, etc.) was persisted into history as a User-role nudge AFTER the real user message (PR #1178). Strict ChatML templates (Qwen3) read that trailing user-role block as a fresh user turn, so the model restarted its assistant response and re-narrated the same plan on every tool-loop iteration until context filled and force-compacted — the production spin on D0AC6CKBK5K. Same class of bug as #634, relocated by #1178. Add SessionState.AddVolatileContextNudge, which inserts the volatile block BEFORE the most recent real user message (appending only when the last entry isn't a real user message: reminder/scheduled/cold-recovery/delivery-retry). The real user message stays at the tail, so every chat template anchors to it correctly. AddSystemNudge stays append-only for mid-turn corrective nudges (empty-response, duplicate-tool, budget, delivery-retry), which need the tail position to course-correct. KV-cache prefix stability from #1178 is preserved: the inserted message's byte position is fixed once added and subsequent turns only append, so byte-prefix caching providers extend the cached prefix through it unchanged. Affected: 0.20.3, 0.21.0, 0.21.1.

Aaronontheweb added 3 commits April 13, 2026 13:34

Merge branch 'dev' into claude-wt-memory-recall-weak

ef697ec

Aaronontheweb marked this pull request as ready for review April 13, 2026 14:34

Aaronontheweb enabled auto-merge (squash) April 13, 2026 14:34

Aaronontheweb merged commit f7be64e into dev Apr 13, 2026
4 checks passed

Aaronontheweb deleted the claude-wt-memory-recall-weak branch April 13, 2026 14:38

This was referenced May 28, 2026

Sessions spin: agent re-narrates the same plan on every tool-loop iteration (strict ChatML / Qwen3) #1216

Closed

fix(sessions): stop tool-loop spin from volatile-context placement #1218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL#634

fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL#634
Aaronontheweb merged 3 commits into
devfrom
claude-wt-memory-recall-weak

Aaronontheweb commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Apr 13, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant