Skip to content

Experiment: Layer-adaptive turbo3 (LA-1) on Metal #7

@TheTom

Description

@TheTom

Hypothesis

First 4 + last 4 layers at q8_0, rest turbo3, improves PPL while maintaining compression on Apple Silicon.

Background

buun validated on CUDA (RTX 3090): PPL 5.769 which is 1.17% better than q8_0. 3.5x compression. 99.6% prefill speed, 97.5% decode speed. Recommended config for contexts up to 65K.

What to test

  • TURBO_LAYER_ADAPTIVE=1 env var (if implemented) or manual per-layer cache type selection
  • PPL comparison vs uniform turbo3 and q8_0
  • Decode speed at 2K, 8K, 32K
  • Prefill speed

Expected outcome

Similar PPL improvement on Metal. No kernel changes needed — just config.

Priority

High — zero code changes, highest expected value.

Source

AutoRepl: cexp_55ab69 (buun, fork_dc582a)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions