forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 97
Experiment: Layer-adaptive turbo3 (LA-1) on Metal #7
Copy link
Copy link
Open
Description
Hypothesis
First 4 + last 4 layers at q8_0, rest turbo3, improves PPL while maintaining compression on Apple Silicon.
Background
buun validated on CUDA (RTX 3090): PPL 5.769 which is 1.17% better than q8_0. 3.5x compression. 99.6% prefill speed, 97.5% decode speed. Recommended config for contexts up to 65K.
What to test
TURBO_LAYER_ADAPTIVE=1env var (if implemented) or manual per-layer cache type selection- PPL comparison vs uniform turbo3 and q8_0
- Decode speed at 2K, 8K, 32K
- Prefill speed
Expected outcome
Similar PPL improvement on Metal. No kernel changes needed — just config.
Priority
High — zero code changes, highest expected value.
Source
AutoRepl: cexp_55ab69 (buun, fork_dc582a)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels