Skip to content

feat(sessions): cache-stable message assembly with volatile User-role tail#618

Merged
Aaronontheweb merged 1 commit into
devfrom
cache-prefix-stability
Apr 12, 2026
Merged

feat(sessions): cache-stable message assembly with volatile User-role tail#618
Aaronontheweb merged 1 commit into
devfrom
cache-prefix-stability

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Closes #608.

Netclaw's LLM message assembly previously placed memory recall and dynamic context layers as System-role messages immediately after the persisted system prompt. These contain volatile per-turn content (memory recall, current time, working-context), which meant the llama.cpp prompt-cache prefix match stopped at the boundary between the persisted prompt and the dynamic layer — ~4867 tokens — on every turn. Conversation history was never cached, even with session-sticky routing (#610) pinning requests to the same backend GPU.

The post-#610 eval baseline showed exactly this: multi_turn_python_app T4 saw only a 6% prompt_ms improvement (1959ms → 1834ms) vs the ~500ms target.

The fix

Extract a pure-function SessionMessageAssembler that partitions the outgoing message list so cache-stable content comes first and volatile content is consolidated into a single User-role tail message:

[0]      System  persisted prompt                          (unchanged)
[1]      System  static dynamic context                    NEW: OnceAtStart
                                                           layers + [session]
                                                           + [attachments]
[2..N]   User/Assistant  conversation history              cacheable byte-stable
                                                           prefix across turns
[last]   User  volatile tail                               [memory-recall],
                                                           current time,
                                                           [working-context],
                                                           slash command
                                                           content, overlay,
                                                           turn restart notice

On turn N+1, the longest common prefix with turn N's assembly now extends through all static content and all conversation history that existed at turn N. Cache misses are confined to the new user turn and its runtime context tail. Each additional turn grows the cacheable prefix by exactly one user/assistant pair.

Deterministic regression guard

Aaron explicitly requested a cache-poisoning assertion that runs as a fast unit test — no LLM round-trip. SessionMessageAssemblerTests adds seven pure-function tests:

  • Prefix_is_stable_across_turns_for_same_session — longest common prefix check with a turn-specific recall marker in each turn (sanity assertion that recall actually appears in the volatile tail, and that turn 1's marker does not leak into turn 2's assembly)
  • Prefix_extends_through_history_when_startup_layers_settled — steady-state cache prefix ≥ 3 messages
  • Volatile_tail_message_is_User_role_at_end_of_list
  • Static_block_contains_session_id_and_attachment_hint
  • Working_context_update_does_not_poison_system_prefix
  • Recall_is_not_in_system_prefix_even_when_resolved
  • Volatile_tail_is_suppressed_when_empty

A future change that accidentally places volatile content in an early message will fail these tests loudly.

Also added an actor-driven integration regression (Cache_prefix_is_stable_across_two_turns_in_same_session) that drives two real turns through LlmSessionActor and diffs FakeChatClient.ReceivedMessages[0] vs [1] — this catches regressions where the actor's wiring to the assembler gets broken even if the helper itself is correct.

Cleanups

  • Deleted SessionRecallManager.InjectIntoMessages (33 lines, now dead code)
  • Deleted LlmSessionActor.InjectDynamicContextLayers private method (82 lines)
  • Moved AttachmentContextHint constant from LlmSessionActorSessionMessageAssembler so it lives with its only consumer
  • Updated 3 existing integration tests whose assertions assumed volatile content was in System-role messages (working-context test, session prompt overlay test, turn restart notice test) — they now look across all message roles

Expected impact

Rerunning the Docker eval suite against the Caddy-sticky llama.cpp endpoint should show:

  • multi_turn_python_app T4 cached_tokens growing beyond the 4867 floor (T2 caches T1, T3 caches T1+T2, T4 caches T1+T2+T3)
  • T4 prompt_ms dropping from ~1834ms toward the sub-1000ms range
  • If cached is still flat after this fix, the next lever is adding --cache-reuse N to the llama-server systemd unit (separate testlab-setup change)

Managed providers (Anthropic, OpenAI, OpenRouter) that do their own prefix caching benefit for the same reason: a stable system prefix plus cacheable conversation history means every turn after the first is mostly a cache hit.

Verification

  • ✅ 979 Actor tests pass (8 new, 971 existing)
  • ✅ Slopwatch clean
  • ✅ Build + slnx full build clean

Test plan

  • CI passes on all configurations
  • Post-merge: rebuild Docker eval image and run ./evals/run-evals.sh against the existing Caddy-sticky llama-server endpoint; confirm multi_turn_python_app T4 shows cached_tokens > 4867 and prompt_ms < 1500ms

… tail

Netclaw's LLM call assembly previously placed memory recall and dynamic
context layers as System-role messages immediately after the persisted
system prompt. These contain per-turn volatile content (memory recall,
current time, working-context), which meant the llama.cpp prompt cache
prefix match stopped at the boundary between the persisted prompt and
the dynamic layer — ~4867 tokens — on every turn. Conversation history
was never cached even when session-sticky routing pinned requests to
the same backend GPU.

Extract a pure-function SessionMessageAssembler that partitions the
outgoing message list into:

  [0]     System  persisted prompt                (unchanged)
  [1]     System  static dynamic context          OnceAtStart layers,
                                                  [session], [attachments]
  [2..N]  User/Assistant  conversation history    cacheable byte-stable prefix
  [last]  User    volatile tail                   [memory-recall], current
                                                  time, [working-context],
                                                  slash command content,
                                                  overlay, turn restart notice

Result: on turn N+1, the longest common prefix with turn N's assembly
extends through all static content and all conversation history —
cache misses are confined to the new user message and its runtime
context tail.

Add SessionMessageAssemblerTests with 7 deterministic cache-stability
assertions including the core Prefix_extends_through_history check and
structural guards that prevent future changes from accidentally placing
volatile markers (current_utc, [memory-recall]) in the System prefix.
Add one actor-driven integration test in CompactionIntegrationTests
that drives two real turns and diffs FakeChatClient.ReceivedMessages
to catch regressions in the actor's wiring to the assembler.

Delete SessionRecallManager.InjectIntoMessages and LlmSessionActor's
InjectDynamicContextLayers private method — both are now dead code.

Closes #608
@Aaronontheweb Aaronontheweb marked this pull request as ready for review April 12, 2026 23:21
@Aaronontheweb Aaronontheweb merged commit aa211dc into dev Apr 12, 2026
4 checks passed
@Aaronontheweb Aaronontheweb deleted the cache-prefix-stability branch April 12, 2026 23:21
Aaronontheweb added a commit that referenced this pull request Apr 13, 2026
…NULL

Two independent regressions surfaced together in production Slack sessions
D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849.

1. Tool-loop acknowledgement loop (PR #618 regression). SessionMessageAssembler
   emitted the volatile turn-context tail (memory recall, current time,
   working-context) as a ChatRole.User message at the end of the assembled
   list. During a tool loop, Qwen3's ChatML template read the trailing
   user-role block as a fresh user turn, so the model restarted its assistant
   response on every iteration, scanned back for the last real user content,
   and re-emitted "You're right - I had that backwards" before each tool call
   until context hit 262144/262144 and was force-compacted. Flip the role back
   to System; keep the placement at the end of the list so #618's cache
   stability win is preserved (llama.cpp KV cache is byte-level prefix
   matching, role tag at the end does not affect prefix stability). Update 4
   existing SessionMessageAssembler tests for the role flip and add a new
   Volatile_tail_does_not_create_fake_user_turn_after_tool_result regression
   test that builds a mid-tool-loop history and asserts the trailing message
   is System-role with no User-role after the last Tool/Assistant message.

2. Legacy memory_anchors.domain NOT NULL constraint. PR #588 removed the
   Domain concept from the in-code schema but intentionally shipped no
   migration. CREATE TABLE IF NOT EXISTS is a no-op on existing databases,
   so production DBs still had `domain TEXT NOT NULL` on memory_anchors,
   memory_documents, memory_records, and memory_edges. Every new-anchor
   INSERT has been failing with SQLite Error 19 since the refactor,
   blocking memory curation entirely - no new memories have been written
   since 2026-04-12 20:37 despite successful distillation proposals. Add
   scripts/repair-memory-schema.sql, a one-off rename/rebuild/copy repair
   that drops the legacy column from all four tables while preserving
   rows. FTS5 virtual tables are standalone (no content= mode) and do not
   need touching.

Also handle the AcceptedDistillationProposalsRecorded dead letter: the
observer correctly replies to LlmSessionActor, but the existing handler
only lived inside Passivating(). Add no-op Command<> registrations in
Ready(), Processing(), and Compacting() so the informational reply does
not land in DeadLetters during normal session states. Capture Sender into
a local in SessionMemoryObserverActor.HandleRecordAcceptedDistillationProposals
before calling Persist - the standard Akka.NET defense against Sender
being overwritten by an interleaved message before the persist callback
fires.
Aaronontheweb added a commit that referenced this pull request Apr 13, 2026
…NULL (#634)

* fix(sessions): recover from volatile-tail loop and legacy domain NOT NULL

Two independent regressions surfaced together in production Slack sessions
D0AC6CKBK5K/1776051715.090089 and D0AC6CKBK5K/1776075016.334849.

1. Tool-loop acknowledgement loop (PR #618 regression). SessionMessageAssembler
   emitted the volatile turn-context tail (memory recall, current time,
   working-context) as a ChatRole.User message at the end of the assembled
   list. During a tool loop, Qwen3's ChatML template read the trailing
   user-role block as a fresh user turn, so the model restarted its assistant
   response on every iteration, scanned back for the last real user content,
   and re-emitted "You're right - I had that backwards" before each tool call
   until context hit 262144/262144 and was force-compacted. Flip the role back
   to System; keep the placement at the end of the list so #618's cache
   stability win is preserved (llama.cpp KV cache is byte-level prefix
   matching, role tag at the end does not affect prefix stability). Update 4
   existing SessionMessageAssembler tests for the role flip and add a new
   Volatile_tail_does_not_create_fake_user_turn_after_tool_result regression
   test that builds a mid-tool-loop history and asserts the trailing message
   is System-role with no User-role after the last Tool/Assistant message.

2. Legacy memory_anchors.domain NOT NULL constraint. PR #588 removed the
   Domain concept from the in-code schema but intentionally shipped no
   migration. CREATE TABLE IF NOT EXISTS is a no-op on existing databases,
   so production DBs still had `domain TEXT NOT NULL` on memory_anchors,
   memory_documents, memory_records, and memory_edges. Every new-anchor
   INSERT has been failing with SQLite Error 19 since the refactor,
   blocking memory curation entirely - no new memories have been written
   since 2026-04-12 20:37 despite successful distillation proposals. Add
   scripts/repair-memory-schema.sql, a one-off rename/rebuild/copy repair
   that drops the legacy column from all four tables while preserving
   rows. FTS5 virtual tables are standalone (no content= mode) and do not
   need touching.

Also handle the AcceptedDistillationProposalsRecorded dead letter: the
observer correctly replies to LlmSessionActor, but the existing handler
only lived inside Passivating(). Add no-op Command<> registrations in
Ready(), Processing(), and Compacting() so the informational reply does
not land in DeadLetters during normal session states. Capture Sender into
a local in SessionMemoryObserverActor.HandleRecordAcceptedDistillationProposals
before calling Persist - the standard Akka.NET defense against Sender
being overwritten by an interleaved message before the persist callback
fires.

* refactor(sessions): collapse distillation ack no-op and trim test noise

Review cleanup on top of the prior volatile-tail and NOT-NULL fixes.

- LlmSessionActor: extract the three identical no-op
  `Command<AcceptedDistillationProposalsRecorded>(_ => { })` registrations
  from Ready/Processing/Compacting into a single `CommandDistillationAckNoOp()`
  helper with one canonical comment explaining why the non-passivation
  states need to swallow the observer's informational reply. Net: 3
  copies of the handler and 3 near-duplicate comment blocks collapse
  to 1 helper + 1 comment + 3 single-line call sites.

- SessionMessageAssemblerTests:
  * Recall_is_not_in_leading_system_prefix_even_when_resolved: drop the
    StringBuilder in favour of the per-message / break-on-non-System
    pattern already established by AssertNoVolatileContentInSystemPrefix.
    Same invariant, less allocation, same shape as the rest of the file.
  * Prefix_extends_through_history_when_startup_layers_settled,
    Volatile_tail_message_is_System_role_at_end_of_list, and
    Working_context_update_does_not_poison_system_prefix: trim WHAT-
    narrating comments that the identifiers and assertions already
    convey. Load-bearing WHY comments on the new regression test, the
    SessionMessageAssembler XML doc, and the SessionMemoryObserverActor
    Sender-capture rationale are preserved.

No behavioural change. Verified with
`dotnet test --filter "FullyQualifiedName~Sessions"` (253/253 passing)
and `dotnet slopwatch analyze` (0 issues).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System prompt cache busting: reorder dynamic context layers + move memory recall out of system role

1 participant