fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL#634
Merged
Conversation
…NULL Two independent regressions surfaced together in production Slack sessions D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849. 1. Tool-loop acknowledgement loop (PR #618 regression). SessionMessageAssembler emitted the volatile turn-context tail (memory recall, current time, working-context) as a ChatRole.User message at the end of the assembled list. During a tool loop, Qwen3's ChatML template read the trailing user-role block as a fresh user turn, so the model restarted its assistant response on every iteration, scanned back for the last real user content, and re-emitted "You're right - I had that backwards" before each tool call until context hit 262144/262144 and was force-compacted. Flip the role back to System; keep the placement at the end of the list so #618's cache stability win is preserved (llama.cpp KV cache is byte-level prefix matching, role tag at the end does not affect prefix stability). Update 4 existing SessionMessageAssembler tests for the role flip and add a new Volatile_tail_does_not_create_fake_user_turn_after_tool_result regression test that builds a mid-tool-loop history and asserts the trailing message is System-role with no User-role after the last Tool/Assistant message. 2. Legacy memory_anchors.domain NOT NULL constraint. PR #588 removed the Domain concept from the in-code schema but intentionally shipped no migration. CREATE TABLE IF NOT EXISTS is a no-op on existing databases, so production DBs still had `domain TEXT NOT NULL` on memory_anchors, memory_documents, memory_records, and memory_edges. Every new-anchor INSERT has been failing with SQLite Error 19 since the refactor, blocking memory curation entirely - no new memories have been written since 2026-04-12 20:37 despite successful distillation proposals. Add scripts/repair-memory-schema.sql, a one-off rename/rebuild/copy repair that drops the legacy column from all four tables while preserving rows. FTS5 virtual tables are standalone (no content= mode) and do not need touching. Also handle the AcceptedDistillationProposalsRecorded dead letter: the observer correctly replies to LlmSessionActor, but the existing handler only lived inside Passivating(). Add no-op Command<> registrations in Ready(), Processing(), and Compacting() so the informational reply does not land in DeadLetters during normal session states. Capture Sender into a local in SessionMemoryObserverActor.HandleRecordAcceptedDistillationProposals before calling Persist - the standard Akka.NET defense against Sender being overwritten by an interleaved message before the persist callback fires.
Review cleanup on top of the prior volatile-tail and NOT-NULL fixes.
- LlmSessionActor: extract the three identical no-op
`Command<AcceptedDistillationProposalsRecorded>(_ => { })` registrations
from Ready/Processing/Compacting into a single `CommandDistillationAckNoOp()`
helper with one canonical comment explaining why the non-passivation
states need to swallow the observer's informational reply. Net: 3
copies of the handler and 3 near-duplicate comment blocks collapse
to 1 helper + 1 comment + 3 single-line call sites.
- SessionMessageAssemblerTests:
* Recall_is_not_in_leading_system_prefix_even_when_resolved: drop the
StringBuilder in favour of the per-message / break-on-non-System
pattern already established by AssertNoVolatileContentInSystemPrefix.
Same invariant, less allocation, same shape as the rest of the file.
* Prefix_extends_through_history_when_startup_layers_settled,
Volatile_tail_message_is_System_role_at_end_of_list, and
Working_context_update_does_not_poison_system_prefix: trim WHAT-
narrating comments that the identifiers and assertions already
convey. Load-bearing WHY comments on the new regression test, the
SessionMessageAssembler XML doc, and the SessionMemoryObserverActor
Sender-capture rationale are preserved.
No behavioural change. Verified with
`dotnet test --filter "FullyQualifiedName~Sessions"` (253/253 passing)
and `dotnet slopwatch analyze` (0 issues).
This was referenced May 28, 2026
Aaronontheweb
added a commit
that referenced
this pull request
May 28, 2026
…l-loop spin (#1216) (#1218) The per-turn volatile context block (memory recall, current time, working context, etc.) was persisted into history as a User-role nudge AFTER the real user message (PR #1178). Strict ChatML templates (Qwen3) read that trailing user-role block as a fresh user turn, so the model restarted its assistant response and re-narrated the same plan on every tool-loop iteration until context filled and force-compacted — the production spin on D0AC6CKBK5K. Same class of bug as #634, relocated by #1178. Add SessionState.AddVolatileContextNudge, which inserts the volatile block BEFORE the most recent real user message (appending only when the last entry isn't a real user message: reminder/scheduled/cold-recovery/delivery-retry). The real user message stays at the tail, so every chat template anchors to it correctly. AddSystemNudge stays append-only for mid-turn corrective nudges (empty-response, duplicate-tool, budget, delivery-retry), which need the tail position to course-correct. KV-cache prefix stability from #1178 is preserved: the inserted message's byte position is fixed once added and subsequent turns only append, so byte-prefix caching providers extend the cached prefix through it unchanged. Affected: 0.20.3, 0.21.0, 0.21.1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two independent 0.12.0 regressions surfaced together in production Slack sessions
D0AC6CKBK5K/1776051715.090089andD0AC6CKBK5K/1776075016.334849:SessionMessageAssembleremitted the volatile turn-context tail as aChatRole.Usermessage at the end of the list. During a tool loop, Qwen3's ChatML template read the trailing user-role block as a fresh user turn → model restarted its assistant response → scanned back for the last real user content → re-emitted "You're right — I had that backwards" before every tool call, until context hit 262144/262144 and force-compacted. Fix: flip the tail back toChatRole.System, keep feat(sessions): cache-stable message assembly with volatile User-role tail #618's placement at the end of the list. llama.cpp KV cache is byte-level prefix matching, so the role tag at the end has no effect on cache stability — the cache win from feat(sessions): cache-stable message assembly with volatile User-role tail #618 is preserved.domain NOT NULL(PR refactor(memory): remove Domain concept in favor of audience isolation #588). The Domain refactor removed the column from the in-code schema but shipped no migration.CREATE TABLE IF NOT EXISTSis a no-op on existing DBs, so production databases still haddomain TEXT NOT NULLonmemory_anchors,memory_documents,memory_records, andmemory_edges. Every new-anchor INSERT has been failing with SQLite Error 19 since the refactor — no new memories have been written since 2026-04-12 20:37 despite successful distillation proposals. Fix: one-offscripts/repair-memory-schema.sqlthat rename/rebuild/copy the four tables without the legacy column, preserving all rows. FTS5 virtual tables are standalone (nocontent=mode) and do not need touching. No formal migration because Netclaw is single-operator (explicit call from refactor(memory): remove Domain concept in favor of audience isolation #588).Also cleans up the
AcceptedDistillationProposalsRecordeddead-letter warning that fires on every curation write: the existing handler only lived insidePassivating(), so the observer's informational reply landed in dead letters during normal session states. Adds no-opCommand<>registrations inReady(),Processing(), andCompacting(), and capturesSenderinto a local inHandleRecordAcceptedDistillationProposalsbeforePersistas defense against the standard Akka.NET Persist-callback Sender overwrite.Changes
src/Netclaw.Actors/Sessions/SessionMessageAssembler.csChatRole.UsertoChatRole.System. One-line change + XML doc explaining why.src/Netclaw.Actors.Tests/Sessions/SessionMessageAssemblerTests.csVolatile_tail_message_is_User_role_at_end_of_list→_System_role_and flip assertion. Update 3 existing tests that located the tail by role. RewriteRecall_is_not_in_leading_system_prefix_even_when_resolvedto walk only the leading contiguous system prefix. Add new regression testVolatile_tail_does_not_create_fake_user_turn_after_tool_result.scripts/repair-memory-schema.sqlmemory_anchors,memory_documents,memory_records,memory_edgeswithoutdomain. Drops domain-referencing indexes and recreates the non-domain ones.src/Netclaw.Actors/Sessions/LlmSessionActor.csCommand<AcceptedDistillationProposalsRecorded>handlers inReady(),Processing(),Compacting().src/Netclaw.Actors/Sessions/SessionMemoryObserverActor.csSenderinto localreplyTobeforePersistinHandleRecordAcceptedDistillationProposals.Test plan
dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~SessionMessageAssemblerTests"→ 8/8 passing (5 existing + new regression test)dotnet test src/Netclaw.Actors.Tests/Netclaw.Actors.Tests.csproj --filter "FullyQualifiedName~Sessions"→ 253/253 passing (no collateral breakage fromLlmSessionActor/ observer edits)dotnet slopwatch analyze→ 0 issuesscripts/repair-memory-schema.sqlagainst a copy of the live~/.netclaw/netclaw.db: row counts preserved (anchors=91, documents=74, records=24, edges=0), zerodomaincolumns remain, all indexes rebuiltcache_nKV-cache telemetry remains comparable to post-feat(sessions): cache-stable message assembly with volatile User-role tail #618 baseline (not pre-feat(sessions): cache-stable message assembly with volatile User-role tail #618)