Add multi-branch (beam) search to the constrained decoder by FuJacob · Pull Request #518 · FuJacob/cotabby

FuJacob · 2026-06-02T01:25:16Z

Summary

Adds a multi-branch (beam) search to the constrained decoder so it can explore several short
continuations at once and keep the highest-scoring one, instead of committing to a single greedy
token at each step (greedy argmax can pick a locally-best token that leads nowhere good).

The search is written against a small BeamLogitsProvider closure — "give me the next-token logits
for this branch's token path" — so the whole algorithm is unit-tested against a scripted closure that
returns fixed logits, with no model required. The live llama adapter is a thin KV-sync method on
the runtime (beamLogits / syncBeamSequence): before reading a branch's logits it trims the shared
sequence's KV cache back to the longest shared prefix and re-accepts the remainder, all under the
existing autocompleteLock, and the existing KV-trim defer restores prompt-only state afterward. (A
closure rather than a protocol object because the inference engine is a noncopyable C++ type that
cannot be stored in a separate adapter object.)

Each branch reuses the same machinery as the greedy path: the token profile, the no-repeat-ngram
guard, the sentence-boundary stop, single-line newline masking, and the confidence-suppression gate.
Branches are scored by cumulative log-probability and the result is ranked by mean log-probability.

Gated behind the existing developer flags: cotabbyConstrainedDecoderEnabled (default off) plus a new
cotabbyConstrainedBeamWidth (default 1 = the existing single-path greedy decode). The shipping
sampler path is untouched.

Validation

xcodebuild ... test ... CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests \
  -only-testing:CotabbyTests/ConstrainedSamplerTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
# ** TEST SUCCEEDED **
#   ConstrainedBeamSearchTests: 8/8 — greedy width-1, beam exploring a second-best first token that
#     greedy misses, EOG stop, single-line newline not emitted, sentence-boundary stop, max-token
#     budget, plus rankedAdmissibleTokens ordering/cap
#   ConstrainedSamplerTests + cancellation tests: unchanged, green

swiftlint --strict --quiet   # exit 0 (clean)
xcodegen generate            # registered the two new files (CI drift guard passes)

Linked issues

None. Brings the constrained decoder up to a real multi-branch search.

Risk / rollout notes

Default-off, twice over. Beam only runs when the constrained decoder flag is on AND
cotabbyConstrainedBeamWidth > 1. With the defaults, behavior is byte-for-byte the existing
greedy/sampler path. No change to shipped suggestion behavior.
The algorithm is fully unit-tested; the engine adapter is intentionally thin. The only part not
exercised by tests is the ~25-line EngineBeamStepper KV-sync (it uses the same trimKV /
acceptToken / getNextTokenLogits calls the greedy decoder already relies on). Enabling beam by
default still needs on-device evaluation of completion quality and per-keystroke latency (beam width
W decodes up to ~W× the greedy token count) — that is the next step before promoting the flag.
A future optimization could add KV snapshot/restore to the inference engine to avoid the
trim-and-re-accept on each branch switch; not required for correctness.

Greptile Summary

This PR adds a multi-branch (beam) search mode to the constrained decoder, letting it explore several continuations per step and return the highest mean-log-probability one instead of committing greedily to each token. The new path is double-gated (cotabbyConstrainedDecoderEnabled AND cotabbyConstrainedBeamWidth > 1), so shipped behavior is unchanged.

ConstrainedBeamSearch.swift implements the search engine as a pure closure-based algorithm (BeamLogitsProvider), reusing the same token profile, no-repeat-ngram guard, single-line newline masking, and sentence-boundary stop as the greedy path; 8 unit tests exercise all core stop conditions against a scripted logits closure.
LlamaRuntimeCore adds runConstrainedBeamDecode and a syncBeamSequence KV-cache sync helper that trims the shared sequence to the longest common prefix of the current and target paths before re-accepting branch tokens, all under the existing autocompleteLock.
ConstrainedSampler gains rankedAdmissibleTokens (the multi-candidate analogue of selectToken) and LlamaGenerationOptions gains beamWidth: Int = 1, intentionally excluded from SamplingFingerprint to preserve KV reuse.

Confidence Score: 3/5

Safe to merge as-shipped since the beam path is double-gated and off by default, but the decode loop holds autocompleteLock without any cancellation check, which will degrade responsiveness noticeably during dogfood testing with the flags enabled.

The new beam decode path does not check Task.isCancelled between steps, unlike both the sampled and greedy decoders that break early on cancel. Under fast typing, each new keystroke supersedes the previous request; for greedy/sampled the lock is freed after 1-2 tokens, but for beam the lock is held for the full W x N step search, making every subsequent request wait. A secondary gap is that branches with no remaining admissible tokens are silently discarded rather than collected as partial completions, inconsistent with the greedy decoder behavior.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift (missing cancellation in runConstrainedBeamDecode) and Cotabby/Support/ConstrainedBeamSearch.swift (silent branch loss on empty candidate list).

Important Files Changed

Filename	Overview
Cotabby/Support/ConstrainedBeamSearch.swift	New file implementing the beam search engine; algorithmically sound for the happy path, but branches that exhaust all admissible tokens are silently dropped rather than completed, inconsistent with the greedy decoder behavior.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Adds runConstrainedBeamDecode and KV-sync helpers; logic is correct but lacks the Task.isCancelled cooperative cancellation present in both other decode paths, causing the lock to be held for the full beam budget even when the request is superseded.
Cotabby/Support/ConstrainedSampler.swift	Adds rankedAdmissibleTokens; implementation is correct and deterministic, but the docstring incorrectly claims it produces the same candidate set as selectToken when topK < vocabSize.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Adds constrainedBeamWidth UserDefaults knob with a safe default and threads it into LlamaGenerationOptions; clean and low-risk.
Cotabby/Models/LlamaRuntimeModels.swift	Adds beamWidth: Int = 1 to LlamaGenerationOptions with correct exclusion from SamplingFingerprint matching the existing useConstrainedDecoder approach.
CotabbyTests/ConstrainedBeamSearchTests.swift	Good unit test coverage of the main beam-search behaviors with a scripted logits provider; no test for the empty-candidates case.
Cotabby.xcodeproj/project.pbxproj	Registers ConstrainedBeamSearch.swift in the main target and ConstrainedBeamSearchTests.swift in the test target; straightforward Xcode project file update.

Sequence Diagram

sequenceDiagram
    participant SE as LlamaSuggestionEngine
    participant RC as LlamaRuntimeCore
    participant BS as ConstrainedBeamSearch
    participant KV as Engine (KV cache)

    SE->>RC: "generate(options beamWidth>1)"
    RC->>RC: obtain sequence, hold autocompleteLock
    RC->>BS: search(nextLogits closure, profile, config)
    loop each beam step
        loop each live branch
            BS->>RC: nextLogits(branch.tokenIDs)
            RC->>KV: trimKV to common prefix
            RC->>KV: acceptToken for remainder
            KV-->>RC: logits[vocabSize]
            RC-->>BS: [Float]
            BS->>BS: expand across topK candidates
            BS->>BS: route stops to completed, rest to live
        end
        BS->>BS: prune live to beamWidth
    end
    BS-->>RC: [BeamCandidate] sorted by meanLogprob
    RC->>RC: shouldSuppress check
    RC->>KV: trimKV to prompt only
    RC-->>SE: best.text

Comments Outside Diff (1)

Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 419-542 (link)

Missing cooperative cancellation in beam decode

Both the sampled decoder (runEngineSampledDecode) and the greedy constrained decoder (runConstrainedDecode) check Task.isCancelled at each token step and break early, releasing autocompleteLock promptly. runConstrainedBeamDecode holds the lock for the entire ConstrainedBeamSearch.search call with no cancellation hook anywhere in the search loop. During fast typing, every new keystroke supersedes the in-flight generation — for greedy/sampled this means only ~1–2 tokens are decoded before the lock is freed; for beam with width W and budget N, the next request waits up to W×N KV syncs before it can proceed.

_{Reviews (1): Last reviewed commit: "Add multi-branch (beam) search to the co..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Greedy argmax can commit to a locally-best token that leads nowhere good. This adds a beam search that explores several short continuations at once and keeps the highest-scoring one (by mean log-probability), reusing the same token profile, no-repeat-ngram guard, sentence-boundary stop, single-line masking, and confidence suppression as the greedy path. The algorithm runs against a BeamLogitsProvider closure so it is fully unit-tested against a scripted fake (no model needed); the live llama adapter (beamLogits/syncBeamSequence) is a thin KV-sync method that trims the shared sequence to a branch's prefix and re-accepts the rest. A closure rather than a protocol object because CotabbyInferenceEngine is a noncopyable C++ type. Gated behind the existing cotabbyConstrainedDecoderEnabled flag (default off) plus a new cotabbyConstrainedBeamWidth (default 1 = the existing greedy decode), so shipped behavior is unchanged. Enabling beam by default still needs on-device quality/latency evaluation.

greptile-apps · 2026-06-02T01:33:28Z

+        let blocked = RepetitionGuard.blockedTokens(
+            history: branch.tokenIDs,
+            ngramSize: configuration.noRepeatNgramSize
+        )
+        let candidates = ConstrainedSampler.rankedAdmissibleTokens(
+            logits: logits,
+            profile: profile,
+            admissibleTokenIDs: nil,
+            topK: configuration.topK,
+            blockedTokenIDs: blocked
+        )
+        for tokenID in candidates {
+            if profile.isEndOfGeneration(tokenID) {
+                completed.append(branch)
+                continue
+            }
+            if isSingleLine, profile.isNewline(tokenID) {
+                completed.append(branch)
+                continue
+            }
+            let tokenBytes = profile.bytes(for: tokenID)
+            let child = extend(branch, by: tokenID, tokenBytes: tokenBytes, logits: logits)
+            if Self.completesSentence(child.bytes, lastTokenBytes: tokenBytes) {
+                completed.append(child)
+            } else {
+                live.append(child)
+            }
+        }
+    }


Branch silently dropped when no admissible tokens remain

When rankedAdmissibleTokens returns an empty list (all tokens excluded or blocked by the no-repeat-ngram guard), the for tokenID in candidates loop body never executes. The branch is appended to neither live nor completed, so it disappears silently. In contrast, the greedy constrained decoder breaks with stopReason = "no_admissible_token" and returns whatever it has generated so far. A branch with partial output that cannot be extended should be treated as a completion (completed.append(branch)) rather than discarded, to stay consistent with the greedy behavior.

greptile-apps · 2026-06-02T01:33:29Z

+    /// The admissible token ids for a step, ranked highest-logit first. Survivors are the same set
+    /// `selectToken` would consider — in-range, not `profile.isExcluded`, not in `blockedTokenIDs`,
+    /// and, when `admissibleTokenIDs` is non-nil, members of that set — and at most `topK` are
+    /// returned. This is the multi-candidate form of `selectToken`: the beam search expands a branch
+    /// across these instead of committing to the single best. Ties break by lower id for determinism.


Docstring overstates equivalence with selectToken's topK semantics

The comment says "Survivors are the same set selectToken would consider", but the two functions handle topK differently. selectToken pre-filters to the top-K tokens by raw logit (via candidatePool) and then applies constraints — so a token at raw-logit rank K+1 is never seen even if it would pass all filters. rankedAdmissibleTokens filters the full vocabulary first and then takes the top-K survivors, so a token at raw-logit rank K+1 that passes all filters does appear. The beam search therefore explores a strictly broader set than greedy for any topK < vocabSize, which is intentional, but the "same set" claim is inaccurate and may confuse future maintainers comparing the two code paths.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

FuJacob merged commit 01fe1bb into main Jun 2, 2026
4 checks passed

FuJacob deleted the feat/constrained-beam-search branch June 2, 2026 01:30

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multi-branch (beam) search to the constrained decoder#518

Add multi-branch (beam) search to the constrained decoder#518
FuJacob merged 1 commit into
mainfrom
feat/constrained-beam-search

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading