forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 92
Experiment: turbo2 (2-bit) format on Metal #8
Copy link
Copy link
Closed
Description
Hypothesis
2-bit quantization (4 centroids) trades quality for maximum compression, enabling 70B+ models on consumer hardware.
Background
4 centroids instead of 8. On Metal's constrained constant cache, fewer reads (2 vs 4) = bigger relative win. Quality will degrade but useful for extreme memory pressure.
What to test
- Implement 2-bit codebook (4 centroids + norm)
- PPL comparison vs turbo3 and q4_0
- Decode speed (fewer LUT reads should help on M2 Pro especially)
- NIAH at 2-bit
- Memory savings at 128K context
Expected outcome
~7x compression, significant PPL degradation but potentially usable for large models.
Priority
Low — exploratory.
Source
AutoRepl: TODO-010 (buun, fork_dc582a)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels