Reuse FM LanguageModelSession and prewarm on focus change#336
Merged
Conversation
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
- Prewarm Task uses guard let self instead of optional chain so the cache-reset await cannot be silently skipped when the coordinator is deallocated mid-shutdown. - Document the SystemLanguageModel singleton assumption in ensureSession so the cache-key omission is explicit for future readers.
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.
- Prewarm Task uses guard let self instead of optional chain so the cache-reset await cannot be silently skipped when the coordinator is deallocated mid-shutdown. - Document the SystemLanguageModel singleton assumption in ensureSession so the cache-key omission is explicit for future readers.
The previous unconditional cache reuse exposed two Apple-surfaced failure modes that the pre-PR per-request-session design was structurally immune to: - Concurrent requests on the same session. Swift task cancellation is cooperative, so cancelPredictionWork() + schedulePrediction() can leave the previous stream still draining inside Apple's runtime when the new request calls streamResponse on the same cachedSession. Apple's session is single-flight and raises .concurrentRequests, which mapGenerationError surfaces as a user-visible generationFailed. - Transcript accumulation across keystrokes in one field. Each respond appends the prompt and response entries to session.transcript. Over ~15-20 keystrokes the 4096-token shared context fills up and Apple raises .exceededContextWindowSize. ensureSession now reuses the cached session only when it is both !isResponding and still at its pristine transcript count. After any respond has run, the next request builds a fresh session — matching the pre-PR single-turn behavior — while the prewarm benefit on the first keystroke after focus is preserved (prewarm leaves the transcript untouched). No protocol or test-fixture change; existing FoundationModelPromptRenderer and SuggestionEngineRouter tests still pass.
251fc76 to
293cf46
Compare
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cotabby's Apple Foundation Models path currently constructs a new
LanguageModelSessionon every keystroke, which throws away Apple's instruction-prefix cache and pays full instruction tokenization on each request. This change holds the session as long as the rendered instructions are unchanged, and asks the engine toprewarm()after focus arrives on a new editable surface so weight loading and prefix tokenization happen before the user types instead of inside the critical path of the firstrespond.Stacked on
eval/fm-expand-driftset— verify with that PR's expanded eval suite once both land.Validation
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build CODE_SIGNING_ALLOWED=NO→
** BUILD SUCCEEDED **xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO→
** TEST BUILD SUCCEEDED **xcodebuild test ... -only-testing:CotabbyTests/FoundationModelPromptRendererTests -only-testing:CotabbyTests/SuggestionEngineRouterTests→
Executed 9 tests, with 0 failures (0 unexpected)swiftlint lint --quiet→ no violations.End-to-end FM perf check expected to be done with the eval suite from the parent PR.
Linked issues
Refs the FM-quality investigation.
Risk / rollout notes
SuggestionGeneratinggainedprewarm(for:). A default no-op extension keeps every existing conformer (llama engine, router, unavailable engine, theStubSuggestionEnginein tests) source-compatible. Only FM provides a real implementation.FoundationModelPromptRenderer.sessionInstructions(for:), so its content drives invalidation correctly without an extra cache-busting hook.resetCachedGenerationContextcontinues to clear the cache, so a field change or settings edit still produces a fresh session next request. The new prewarm Task awaits the existing cache-reset barrier (awaitCachedGenerationContextResetIfNeeded) so prewarm always runs after the reset and cannot have its session nulled out from under it.hasFocusedElementChanged), not on every focus poll, so the call site does not thrash on noisy AX updates. The engine internally skips work when the session is already cached, so duplicate calls are cheap regardless.Greptile Summary
This PR introduces two related optimisations to the Apple Foundation Models path: a bounded
LanguageModelSessioncache (keyed on rendered instructions, guarded by!isRespondingand a pristine-transcript check) so the prewarmed session is handed to the first real keystroke request, and aprewarm(for:)hook wired into the focus-change path so weight loading and instruction tokenisation happen before the user types rather than insiderespond's critical path.ensureSession): reuse is correctly limited to one request — the three-way predicate addresses theconcurrentRequestsandexceededContextWindowSizefailure modes flagged in prior reviews. Theguard let selffix inprewarmEngineForCurrentFieldand the cache-reset barrier order are both handled correctly.prewarm(for:)is added toSuggestionGeneratingwith a default no-op, keeping llama, router, stub, and unavailable engines source-compatible with zero changes.respondis replaced bystreamResponsewith a cumulative-snapshot loop, enabling cooperative cancellation mid-decode; thedidReceiveSnapshotguard correctly surfaces a zero-output stream as agenerationFailederror.Confidence Score: 4/5
Safe to merge. The session-reuse logic is well-guarded and the prewarm plumbing is correctly sequenced behind the cache-reset barrier.
The three-way
ensureSessionpredicate correctly avoids both the concurrent-stream and transcript-accumulation failure modes. One open question — whether Apple's runtime updatestranscript.countwhen a stream is cancelled mid-flight — means the "consumed once" bound is a documented assumption rather than a verified guarantee. This does not produce incorrect output in any observed path, but the behaviour under rapid keystroke cancellations relies on an undocumented Apple API contract.Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift — specifically the
ensureSessionreuse predicate and thetranscript.count == pristineTranscriptCountcondition for cancelled streams.Important Files Changed
cachedSession,ensureSessionreuse predicate,prewarm(for:), and switches fromrespondtostreamResponse. The three-way reuse guard (instructions,!isResponding,transcript.count) correctly addresses the concurrentRequests and context-overflow risks raised in prior reviews, but the "consumed once" bound implicitly relies on an undocumented Apple API behaviour for cancelled streams.prewarmEngineForCurrentFieldwith properguard let self, cache-reset barrier await, and generation-0 sentinel. Previous concern about the optional-chain bypass is fully resolved here.prewarm(for:)toSuggestionGeneratingwith a default no-op extension, keeping all existing conformers source-compatible.prewarm(for:)that routes only to the selected engine; correctly avoids allocating an FM session on the llama-selected path.Sequence Diagram
sequenceDiagram participant Coord as SuggestionCoordinator participant Engine as FoundationModelSuggestionEngine participant Apple as LanguageModelSession (Apple) Note over Coord: User focuses new editable field Coord->>Coord: cancelPredictionWork() Coord->>Coord: resetCachedGenerationContext() [schedules Task] Coord->>Coord: prewarmEngineForCurrentField() [schedules Task] Note over Coord: Reset Task runs Coord->>Engine: resetCachedGenerationContext() Engine->>Engine: "cachedSession = nil" Note over Coord: Prewarm Task runs (after barrier) Coord->>Coord: awaitCachedGenerationContextResetIfNeeded() Coord->>Engine: "prewarm(for: request[gen=0])" Engine->>Engine: ensureSession() → new CachedSession Engine->>Apple: session.prewarm() [fire-and-forget] Apple-->>Apple: loads weights, primes prefix cache Note over Coord: User types (first keystroke) Coord->>Engine: "generateSuggestion(for: request[gen=N])" Engine->>Engine: ensureSession() → reuse cached Engine->>Apple: session.streamResponse(to: prompt) Apple-->>Engine: partial snapshots (cumulative) Engine-->>Coord: SuggestionResult Note over Engine: transcript.count > pristineTranscriptCount Note over Coord: User types again (second keystroke) Coord->>Engine: "generateSuggestion(for: request[gen=N+1])" Engine->>Engine: ensureSession() → transcript diverged → new session Engine->>Apple: new LanguageModelSession(...) Engine->>Apple: session.streamResponse(to: prompt) Apple-->>Engine: partial snapshots Engine-->>Coord: SuggestionResultComments Outside Diff (1)
Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift, line 61-65 (link)LanguageModelSession.respond(to:)appends each prompt and response to the session's internalTranscript, meaning each subsequent autocomplete call within the same editing context sees all prior prompts and model outputs as conversation history. The original code explicitly guarded against this, and that concern still applies. Over a long typing session in one field, the growing transcript will progressively steer the model away from clean single-turn completions and will eventually throw.exceededContextWindowSize(mapped to a user-visiblegenerationFailederror). TheresetCachedGenerationContextpath clears the session on field change, but not between individual requests within the same field.Reviews (4): Last reviewed commit: "Trigger CI after base change to main" | Re-trigger Greptile