Reuse FM LanguageModelSession and prewarm on focus change by FuJacob · Pull Request #336 · FuJacob/cotabby

FuJacob · 2026-05-28T08:35:33Z

Summary

Cotabby's Apple Foundation Models path currently constructs a new LanguageModelSession on every keystroke, which throws away Apple's instruction-prefix cache and pays full instruction tokenization on each request. This change holds the session as long as the rendered instructions are unchanged, and asks the engine to prewarm() after focus arrives on a new editable surface so weight loading and prefix tokenization happen before the user types instead of inside the critical path of the first respond.

Stacked on eval/fm-expand-driftset — verify with that PR's expanded eval suite once both land.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build CODE_SIGNING_ALLOWED=NO
→ ** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO
→ ** TEST BUILD SUCCEEDED **

xcodebuild test ... -only-testing:CotabbyTests/FoundationModelPromptRendererTests -only-testing:CotabbyTests/SuggestionEngineRouterTests
→ Executed 9 tests, with 0 failures (0 unexpected)

swiftlint lint --quiet → no violations.

End-to-end FM perf check expected to be done with the eval suite from the parent PR.

Linked issues

Refs the FM-quality investigation.

Risk / rollout notes

SuggestionGenerating gained prewarm(for:). A default no-op extension keeps every existing conformer (llama engine, router, unavailable engine, the StubSuggestionEngine in tests) source-compatible. Only FM provides a real implementation.
The cached session is keyed on the rendered instructions string. Any user-driven change that alters instructions (language override, custom rules) flows through FoundationModelPromptRenderer.sessionInstructions(for:), so its content drives invalidation correctly without an extra cache-busting hook.
resetCachedGenerationContext continues to clear the cache, so a field change or settings edit still produces a fresh session next request. The new prewarm Task awaits the existing cache-reset barrier (awaitCachedGenerationContextResetIfNeeded) so prewarm always runs after the reset and cannot have its session nulled out from under it.
Prewarm is gated on a real field-change event (hasFocusedElementChanged), not on every focus poll, so the call site does not thrash on noisy AX updates. The engine internally skips work when the session is already cached, so duplicate calls are cheap regardless.

Greptile Summary

This PR introduces two related optimisations to the Apple Foundation Models path: a bounded LanguageModelSession cache (keyed on rendered instructions, guarded by !isResponding and a pristine-transcript check) so the prewarmed session is handed to the first real keystroke request, and a prewarm(for:) hook wired into the focus-change path so weight loading and instruction tokenisation happen before the user types rather than inside respond's critical path.

Session caching (ensureSession): reuse is correctly limited to one request — the three-way predicate addresses the concurrentRequests and exceededContextWindowSize failure modes flagged in prior reviews. The guard let self fix in prewarmEngineForCurrentField and the cache-reset barrier order are both handled correctly.
Protocol extension: prewarm(for:) is added to SuggestionGenerating with a default no-op, keeping llama, router, stub, and unavailable engines source-compatible with zero changes.
Streaming: respond is replaced by streamResponse with a cumulative-snapshot loop, enabling cooperative cancellation mid-decode; the didReceiveSnapshot guard correctly surfaces a zero-output stream as a generationFailed error.

Confidence Score: 4/5

Safe to merge. The session-reuse logic is well-guarded and the prewarm plumbing is correctly sequenced behind the cache-reset barrier.

The three-way ensureSession predicate correctly avoids both the concurrent-stream and transcript-accumulation failure modes. One open question — whether Apple's runtime updates transcript.count when a stream is cancelled mid-flight — means the "consumed once" bound is a documented assumption rather than a verified guarantee. This does not produce incorrect output in any observed path, but the behaviour under rapid keystroke cancellations relies on an undocumented Apple API contract.

Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift — specifically the ensureSession reuse predicate and the transcript.count == pristineTranscriptCount condition for cancelled streams.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift	Core of the change: adds `cachedSession`, `ensureSession` reuse predicate, `prewarm(for:)`, and switches from `respond` to `streamResponse`. The three-way reuse guard (`instructions`, `!isResponding`, `transcript.count`) correctly addresses the concurrentRequests and context-overflow risks raised in prior reviews, but the "consumed once" bound implicitly relies on an undocumented Apple API behaviour for cancelled streams.
Cotabby/App/Coordinators/SuggestionCoordinator+Input.swift	Adds `prewarmEngineForCurrentField` with proper `guard let self`, cache-reset barrier await, and generation-0 sentinel. Previous concern about the optional-chain bypass is fully resolved here.
Cotabby/Models/SuggestionSubsystemContracts.swift	Adds `prewarm(for:)` to `SuggestionGenerating` with a default no-op extension, keeping all existing conformers source-compatible.
Cotabby/Services/Runtime/SuggestionEngineRouter.swift	Adds `prewarm(for:)` that routes only to the selected engine; correctly avoids allocating an FM session on the llama-selected path.

Sequence Diagram

sequenceDiagram
    participant Coord as SuggestionCoordinator
    participant Engine as FoundationModelSuggestionEngine
    participant Apple as LanguageModelSession (Apple)

    Note over Coord: User focuses new editable field
    Coord->>Coord: cancelPredictionWork()
    Coord->>Coord: resetCachedGenerationContext() [schedules Task]
    Coord->>Coord: prewarmEngineForCurrentField() [schedules Task]

    Note over Coord: Reset Task runs
    Coord->>Engine: resetCachedGenerationContext()
    Engine->>Engine: "cachedSession = nil"

    Note over Coord: Prewarm Task runs (after barrier)
    Coord->>Coord: awaitCachedGenerationContextResetIfNeeded()
    Coord->>Engine: "prewarm(for: request[gen=0])"
    Engine->>Engine: ensureSession() → new CachedSession
    Engine->>Apple: session.prewarm() [fire-and-forget]
    Apple-->>Apple: loads weights, primes prefix cache

    Note over Coord: User types (first keystroke)
    Coord->>Engine: "generateSuggestion(for: request[gen=N])"
    Engine->>Engine: ensureSession() → reuse cached
    Engine->>Apple: session.streamResponse(to: prompt)
    Apple-->>Engine: partial snapshots (cumulative)
    Engine-->>Coord: SuggestionResult

    Note over Engine: transcript.count > pristineTranscriptCount
    Note over Coord: User types again (second keystroke)
    Coord->>Engine: "generateSuggestion(for: request[gen=N+1])"
    Engine->>Engine: ensureSession() → transcript diverged → new session
    Engine->>Apple: new LanguageModelSession(...)
    Engine->>Apple: session.streamResponse(to: prompt)
    Apple-->>Engine: partial snapshots
    Engine-->>Coord: SuggestionResult

Comments Outside Diff (1)

Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift, line 61-65 (link)

Session transcript accumulates across every keystroke in the same field

LanguageModelSession.respond(to:) appends each prompt and response to the session's internal Transcript, meaning each subsequent autocomplete call within the same editing context sees all prior prompts and model outputs as conversation history. The original code explicitly guarded against this, and that concern still applies. Over a long typing session in one field, the growing transcript will progressively steer the model away from clean single-turn completions and will eventually throw .exceededContextWindowSize (mapped to a user-visible generationFailed error). The resetCachedGenerationContext path clears the session on field change, but not between individual requests within the same field.

_{Reviews (4): Last reviewed commit: "Trigger CI after base change to main" | Re-trigger Greptile}

- Prewarm Task uses guard let self instead of optional chain so the cache-reset await cannot be silently skipped when the coordinator is deallocated mid-shutdown. - Document the SystemLanguageModel singleton assumption in ensureSession so the cache-key omission is explicit for future readers.

Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.

- Prewarm Task uses guard let self instead of optional chain so the cache-reset await cannot be silently skipped when the coordinator is deallocated mid-shutdown. - Document the SystemLanguageModel singleton assumption in ensureSession so the cache-key omission is explicit for future readers.

The previous unconditional cache reuse exposed two Apple-surfaced failure modes that the pre-PR per-request-session design was structurally immune to: - Concurrent requests on the same session. Swift task cancellation is cooperative, so cancelPredictionWork() + schedulePrediction() can leave the previous stream still draining inside Apple's runtime when the new request calls streamResponse on the same cachedSession. Apple's session is single-flight and raises .concurrentRequests, which mapGenerationError surfaces as a user-visible generationFailed. - Transcript accumulation across keystrokes in one field. Each respond appends the prompt and response entries to session.transcript. Over ~15-20 keystrokes the 4096-token shared context fills up and Apple raises .exceededContextWindowSize. ensureSession now reuses the cached session only when it is both !isResponding and still at its pristine transcript count. After any respond has run, the next request builds a fresh session — matching the pre-PR single-turn behavior — while the prewarm benefit on the first keystroke after focus is preserved (prewarm leaves the transcript untouched). No protocol or test-fixture change; existing FoundationModelPromptRenderer and SuggestionEngineRouter tests still pass.

Local FM eval run on the full stack (with #336 bounded to single-turn sessions) showed two cases (codeComment "// This is a workaround for the bug in ", prose "The Swift compiler enforces optionals because ") that the model echoed verbatim instead of continuing. The normalizer correctly strips the echo, but the user-visible result is an empty suggestion. These cases passed when this PR was first measured because the unconditional session reuse left a growing transcript of prior (continue-do-not-echo) demonstrations on every later request — implicit in-context learning that masked the rule removal. Once the engine is bounded to single-turn sessions (#336 follow-up), the rule has to be in the instructions channel for every request, not implicit in transcript history. The new rule pairs positive framing ("Continue from the position immediately after the existing text") with the explicit prohibition that was removed, keeping the spirit of WWDC25's positive-identity guidance while restoring the hard constraint. Eval after this change: drift=3, midword=10, empty=0, noise=0 — same shape as the #335 baseline. A new test_sessionInstructions_forbidEchoingExistingText assertion pins both clauses so a future rewrite cannot silently drop them again.

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Input.swift

Comment thread Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift Outdated

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift

FuJacob added 4 commits May 28, 2026 03:09

Reuse FM LanguageModelSession and prewarm on focus change

91d9d43

Stream Foundation Models responses for cancellable decode (#339)

83d7cf2

FuJacob force-pushed the perf/fm-session-reuse-prewarm branch from 251fc76 to 293cf46 Compare May 28, 2026 10:09

FuJacob changed the base branch from eval/fm-expand-driftset to main May 28, 2026 10:11

Trigger CI after base change to main

e12c762

FuJacob merged commit ca7a7d1 into main May 28, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reuse FM LanguageModelSession and prewarm on focus change#336

Reuse FM LanguageModelSession and prewarm on focus change#336
FuJacob merged 5 commits into
mainfrom
perf/fm-session-reuse-prewarm

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented May 28, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented May 28, 2026 •

edited by greptile-apps Bot

Loading