Skip to content

Reuse FM LanguageModelSession and prewarm on focus change#336

Merged
FuJacob merged 5 commits into
mainfrom
perf/fm-session-reuse-prewarm
May 28, 2026
Merged

Reuse FM LanguageModelSession and prewarm on focus change#336
FuJacob merged 5 commits into
mainfrom
perf/fm-session-reuse-prewarm

Conversation

@FuJacob

@FuJacob FuJacob commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary

Cotabby's Apple Foundation Models path currently constructs a new LanguageModelSession on every keystroke, which throws away Apple's instruction-prefix cache and pays full instruction tokenization on each request. This change holds the session as long as the rendered instructions are unchanged, and asks the engine to prewarm() after focus arrives on a new editable surface so weight loading and prefix tokenization happen before the user types instead of inside the critical path of the first respond.

Stacked on eval/fm-expand-driftset — verify with that PR's expanded eval suite once both land.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build CODE_SIGNING_ALLOWED=NO
** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build-for-testing CODE_SIGNING_ALLOWED=NO
** TEST BUILD SUCCEEDED **

xcodebuild test ... -only-testing:CotabbyTests/FoundationModelPromptRendererTests -only-testing:CotabbyTests/SuggestionEngineRouterTests
Executed 9 tests, with 0 failures (0 unexpected)

swiftlint lint --quiet → no violations.

End-to-end FM perf check expected to be done with the eval suite from the parent PR.

Linked issues

Refs the FM-quality investigation.

Risk / rollout notes

  • SuggestionGenerating gained prewarm(for:). A default no-op extension keeps every existing conformer (llama engine, router, unavailable engine, the StubSuggestionEngine in tests) source-compatible. Only FM provides a real implementation.
  • The cached session is keyed on the rendered instructions string. Any user-driven change that alters instructions (language override, custom rules) flows through FoundationModelPromptRenderer.sessionInstructions(for:), so its content drives invalidation correctly without an extra cache-busting hook.
  • resetCachedGenerationContext continues to clear the cache, so a field change or settings edit still produces a fresh session next request. The new prewarm Task awaits the existing cache-reset barrier (awaitCachedGenerationContextResetIfNeeded) so prewarm always runs after the reset and cannot have its session nulled out from under it.
  • Prewarm is gated on a real field-change event (hasFocusedElementChanged), not on every focus poll, so the call site does not thrash on noisy AX updates. The engine internally skips work when the session is already cached, so duplicate calls are cheap regardless.

Greptile Summary

This PR introduces two related optimisations to the Apple Foundation Models path: a bounded LanguageModelSession cache (keyed on rendered instructions, guarded by !isResponding and a pristine-transcript check) so the prewarmed session is handed to the first real keystroke request, and a prewarm(for:) hook wired into the focus-change path so weight loading and instruction tokenisation happen before the user types rather than inside respond's critical path.

  • Session caching (ensureSession): reuse is correctly limited to one request — the three-way predicate addresses the concurrentRequests and exceededContextWindowSize failure modes flagged in prior reviews. The guard let self fix in prewarmEngineForCurrentField and the cache-reset barrier order are both handled correctly.
  • Protocol extension: prewarm(for:) is added to SuggestionGenerating with a default no-op, keeping llama, router, stub, and unavailable engines source-compatible with zero changes.
  • Streaming: respond is replaced by streamResponse with a cumulative-snapshot loop, enabling cooperative cancellation mid-decode; the didReceiveSnapshot guard correctly surfaces a zero-output stream as a generationFailed error.

Confidence Score: 4/5

Safe to merge. The session-reuse logic is well-guarded and the prewarm plumbing is correctly sequenced behind the cache-reset barrier.

The three-way ensureSession predicate correctly avoids both the concurrent-stream and transcript-accumulation failure modes. One open question — whether Apple's runtime updates transcript.count when a stream is cancelled mid-flight — means the "consumed once" bound is a documented assumption rather than a verified guarantee. This does not produce incorrect output in any observed path, but the behaviour under rapid keystroke cancellations relies on an undocumented Apple API contract.

Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift — specifically the ensureSession reuse predicate and the transcript.count == pristineTranscriptCount condition for cancelled streams.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift Core of the change: adds cachedSession, ensureSession reuse predicate, prewarm(for:), and switches from respond to streamResponse. The three-way reuse guard (instructions, !isResponding, transcript.count) correctly addresses the concurrentRequests and context-overflow risks raised in prior reviews, but the "consumed once" bound implicitly relies on an undocumented Apple API behaviour for cancelled streams.
Cotabby/App/Coordinators/SuggestionCoordinator+Input.swift Adds prewarmEngineForCurrentField with proper guard let self, cache-reset barrier await, and generation-0 sentinel. Previous concern about the optional-chain bypass is fully resolved here.
Cotabby/Models/SuggestionSubsystemContracts.swift Adds prewarm(for:) to SuggestionGenerating with a default no-op extension, keeping all existing conformers source-compatible.
Cotabby/Services/Runtime/SuggestionEngineRouter.swift Adds prewarm(for:) that routes only to the selected engine; correctly avoids allocating an FM session on the llama-selected path.

Sequence Diagram

sequenceDiagram
    participant Coord as SuggestionCoordinator
    participant Engine as FoundationModelSuggestionEngine
    participant Apple as LanguageModelSession (Apple)

    Note over Coord: User focuses new editable field
    Coord->>Coord: cancelPredictionWork()
    Coord->>Coord: resetCachedGenerationContext() [schedules Task]
    Coord->>Coord: prewarmEngineForCurrentField() [schedules Task]

    Note over Coord: Reset Task runs
    Coord->>Engine: resetCachedGenerationContext()
    Engine->>Engine: "cachedSession = nil"

    Note over Coord: Prewarm Task runs (after barrier)
    Coord->>Coord: awaitCachedGenerationContextResetIfNeeded()
    Coord->>Engine: "prewarm(for: request[gen=0])"
    Engine->>Engine: ensureSession() → new CachedSession
    Engine->>Apple: session.prewarm() [fire-and-forget]
    Apple-->>Apple: loads weights, primes prefix cache

    Note over Coord: User types (first keystroke)
    Coord->>Engine: "generateSuggestion(for: request[gen=N])"
    Engine->>Engine: ensureSession() → reuse cached
    Engine->>Apple: session.streamResponse(to: prompt)
    Apple-->>Engine: partial snapshots (cumulative)
    Engine-->>Coord: SuggestionResult

    Note over Engine: transcript.count > pristineTranscriptCount
    Note over Coord: User types again (second keystroke)
    Coord->>Engine: "generateSuggestion(for: request[gen=N+1])"
    Engine->>Engine: ensureSession() → transcript diverged → new session
    Engine->>Apple: new LanguageModelSession(...)
    Engine->>Apple: session.streamResponse(to: prompt)
    Apple-->>Engine: partial snapshots
    Engine-->>Coord: SuggestionResult
Loading

Comments Outside Diff (1)

  1. Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift, line 61-65 (link)

    P1 Session transcript accumulates across every keystroke in the same field

    LanguageModelSession.respond(to:) appends each prompt and response to the session's internal Transcript, meaning each subsequent autocomplete call within the same editing context sees all prior prompts and model outputs as conversation history. The original code explicitly guarded against this, and that concern still applies. Over a long typing session in one field, the growing transcript will progressively steer the model away from clean single-turn completions and will eventually throw .exceededContextWindowSize (mapped to a user-visible generationFailed error). The resetCachedGenerationContext path clears the session on field change, but not between individual requests within the same field.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (4): Last reviewed commit: "Trigger CI after base change to main" | Re-trigger Greptile

Comment thread Cotabby/App/Coordinators/SuggestionCoordinator+Input.swift
Comment thread Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift Outdated
FuJacob added a commit that referenced this pull request May 28, 2026
- Prewarm Task uses guard let self instead of optional chain so the
  cache-reset await cannot be silently skipped when the coordinator is
  deallocated mid-shutdown.
- Document the SystemLanguageModel singleton assumption in
  ensureSession so the cache-key omission is explicit for future
  readers.
Comment thread Cotabby/Services/Runtime/FoundationModelSuggestionEngine.swift
FuJacob added a commit that referenced this pull request May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn
sessions) showed two cases (codeComment "// This is a workaround for the
bug in ", prose "The Swift compiler enforces optionals because ") that
the model echoed verbatim instead of continuing. The normalizer correctly
strips the echo, but the user-visible result is an empty suggestion.

These cases passed when this PR was first measured because the
unconditional session reuse left a growing transcript of prior
(continue-do-not-echo) demonstrations on every later request — implicit
in-context learning that masked the rule removal. Once the engine is
bounded to single-turn sessions (#336 follow-up), the rule has to be in
the instructions channel for every request, not implicit in transcript
history.

The new rule pairs positive framing ("Continue from the position
immediately after the existing text") with the explicit prohibition that
was removed, keeping the spirit of WWDC25's positive-identity guidance
while restoring the hard constraint. Eval after this change: drift=3,
midword=10, empty=0, noise=0 — same shape as the #335 baseline.

A new test_sessionInstructions_forbidEchoingExistingText assertion pins
both clauses so a future rewrite cannot silently drop them again.
FuJacob added 4 commits May 28, 2026 03:09
- Prewarm Task uses guard let self instead of optional chain so the
  cache-reset await cannot be silently skipped when the coordinator is
  deallocated mid-shutdown.
- Document the SystemLanguageModel singleton assumption in
  ensureSession so the cache-key omission is explicit for future
  readers.
The previous unconditional cache reuse exposed two Apple-surfaced
failure modes that the pre-PR per-request-session design was
structurally immune to:

- Concurrent requests on the same session. Swift task cancellation is
  cooperative, so cancelPredictionWork() + schedulePrediction() can
  leave the previous stream still draining inside Apple's runtime when
  the new request calls streamResponse on the same cachedSession.
  Apple's session is single-flight and raises .concurrentRequests,
  which mapGenerationError surfaces as a user-visible generationFailed.

- Transcript accumulation across keystrokes in one field. Each respond
  appends the prompt and response entries to session.transcript. Over
  ~15-20 keystrokes the 4096-token shared context fills up and Apple
  raises .exceededContextWindowSize.

ensureSession now reuses the cached session only when it is both
!isResponding and still at its pristine transcript count. After any
respond has run, the next request builds a fresh session — matching the
pre-PR single-turn behavior — while the prewarm benefit on the first
keystroke after focus is preserved (prewarm leaves the transcript
untouched).

No protocol or test-fixture change; existing FoundationModelPromptRenderer
and SuggestionEngineRouter tests still pass.
@FuJacob FuJacob force-pushed the perf/fm-session-reuse-prewarm branch from 251fc76 to 293cf46 Compare May 28, 2026 10:09
@FuJacob FuJacob changed the base branch from eval/fm-expand-driftset to main May 28, 2026 10:11
@FuJacob FuJacob merged commit ca7a7d1 into main May 28, 2026
4 checks passed
FuJacob added a commit that referenced this pull request May 28, 2026
Local FM eval run on the full stack (with #336 bounded to single-turn
sessions) showed two cases (codeComment "// This is a workaround for the
bug in ", prose "The Swift compiler enforces optionals because ") that
the model echoed verbatim instead of continuing. The normalizer correctly
strips the echo, but the user-visible result is an empty suggestion.

These cases passed when this PR was first measured because the
unconditional session reuse left a growing transcript of prior
(continue-do-not-echo) demonstrations on every later request — implicit
in-context learning that masked the rule removal. Once the engine is
bounded to single-turn sessions (#336 follow-up), the rule has to be in
the instructions channel for every request, not implicit in transcript
history.

The new rule pairs positive framing ("Continue from the position
immediately after the existing text") with the explicit prohibition that
was removed, keeping the spirit of WWDC25's positive-identity guidance
while restoring the hard constraint. Eval after this change: drift=3,
midword=10, empty=0, noise=0 — same shape as the #335 baseline.

A new test_sessionInstructions_forbidEchoingExistingText assertion pins
both clauses so a future rewrite cannot silently drop them again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant