common : inhibit lazy grammar sampler while reasoning is active (#20970)#1
Merged
Conversation
Automatically creates a prerelease with the macOS ARM64 binary on every push to feature/turboquant-kv-cache. Made-with: Cursor
Without target_commitish, softprops/action-gh-release creates tags on the default branch (master) instead of the triggering branch. Made-with: Cursor
Without -DLLAMA_BUILD_BORINGSSL=ON, cmake picks up Homebrew OpenSSL and links dynamically → Team ID mismatch on codesigned macOS apps. Changes: - Add -DLLAMA_BUILD_BORINGSSL=ON (static SSL, no dynamic dependency) - Add -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (apply rpath at build time) - Switch to -DCMAKE_INSTALL_RPATH='@loader_path' (consistent with release.yml) - Add -DLLAMA_BUILD_TOOLS=ON - Add verification step: otool -L check fails CI if dynamic SSL found Made-with: Cursor
LLAMA_BUILD_BORINGSSL doesn't exist in this fork's CMakeLists.txt — the flag was silently ignored, binary still linked Homebrew OpenSSL. Correct approach: disable curl and OpenSSL entirely, build all libs statically. Produces a single self-contained binary with only system dylibs (libSystem, libc++, Metal frameworks). - BUILD_SHARED_LIBS=OFF — links libllama, libggml etc. statically - LLAMA_CURL=OFF — no curl dependency, no HF model download - LLAMA_OPENSSL=OFF — no OpenSSL/crypto dependency - hw.ncpu instead of hw.logicalcpu (correct macOS sysctl key) - Verification step: fail CI if any non-system dylib found Made-with: Cursor
…-org#20970) * common : inhibit grammar while reasoning budget is active * cont : update force_pos in accept * cont : fix tests * cont : tweak should apply logic * cont : return early not using grammar sampler * Add tests * cont : prevent backend sampling when reasoning budget enabled * cont : fix typo --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
6879672 to
7b8820c
Compare
Vect0rM
pushed a commit
that referenced
this pull request
Apr 21, 2026
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vect0rM
pushed a commit
that referenced
this pull request
Apr 21, 2026
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
common : inhibit grammar while reasoning budget is active
cont : update force_pos in accept
cont : fix tests
cont : tweak should apply logic
cont : return early not using grammar sampler
Add tests
cont : prevent backend sampling when reasoning budget enabled
cont : fix typo
Overview
Additional information
Requirements