Skip to content

common : inhibit lazy grammar sampler while reasoning is active (#20970)#1

Merged
Vect0rM merged 5 commits into
feature/turboquant-kv-cachefrom
fix/qwen
Mar 31, 2026
Merged

common : inhibit lazy grammar sampler while reasoning is active (#20970)#1
Vect0rM merged 5 commits into
feature/turboquant-kv-cachefrom
fix/qwen

Conversation

@Ooooze

@Ooooze Ooooze commented Mar 30, 2026

Copy link
Copy Markdown
  • common : inhibit grammar while reasoning budget is active

  • cont : update force_pos in accept

  • cont : fix tests

  • cont : tweak should apply logic

  • cont : return early not using grammar sampler

  • Add tests

  • cont : prevent backend sampling when reasoning budget enabled

  • cont : fix typo


Overview

Additional information

Requirements

Vect0rM and others added 5 commits March 30, 2026 12:34
Automatically creates a prerelease with the macOS ARM64 binary
on every push to feature/turboquant-kv-cache.

Made-with: Cursor
Without target_commitish, softprops/action-gh-release creates tags
on the default branch (master) instead of the triggering branch.

Made-with: Cursor
Without -DLLAMA_BUILD_BORINGSSL=ON, cmake picks up Homebrew OpenSSL
and links dynamically → Team ID mismatch on codesigned macOS apps.

Changes:
- Add -DLLAMA_BUILD_BORINGSSL=ON (static SSL, no dynamic dependency)
- Add -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (apply rpath at build time)
- Switch to -DCMAKE_INSTALL_RPATH='@loader_path' (consistent with release.yml)
- Add -DLLAMA_BUILD_TOOLS=ON
- Add verification step: otool -L check fails CI if dynamic SSL found

Made-with: Cursor
LLAMA_BUILD_BORINGSSL doesn't exist in this fork's CMakeLists.txt —
the flag was silently ignored, binary still linked Homebrew OpenSSL.

Correct approach: disable curl and OpenSSL entirely, build all libs
statically. Produces a single self-contained binary with only system
dylibs (libSystem, libc++, Metal frameworks).

- BUILD_SHARED_LIBS=OFF — links libllama, libggml etc. statically
- LLAMA_CURL=OFF — no curl dependency, no HF model download
- LLAMA_OPENSSL=OFF — no OpenSSL/crypto dependency
- hw.ncpu instead of hw.logicalcpu (correct macOS sysctl key)
- Verification step: fail CI if any non-system dylib found

Made-with: Cursor
…-org#20970)

* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
@Vect0rM Vect0rM force-pushed the feature/turboquant-kv-cache branch from 6879672 to 7b8820c Compare March 31, 2026 08:21
@Vect0rM Vect0rM merged commit d785414 into feature/turboquant-kv-cache Mar 31, 2026
8 of 44 checks passed
Vect0rM pushed a commit that referenced this pull request Apr 21, 2026
Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vect0rM pushed a commit that referenced this pull request Apr 21, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  #6  Select chain:        11.9 (branches kill)
  #7  Bit-arithmetic:      11.6 (ALU too heavy)
  #8  FMA branchless:      11.4 (ALU still too heavy)
  #9  Named-reg ternary:   10.3 (branches worst)
  #10 Main (8-LUT):        10.95 (baseline)
  #11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants