Skip to content

Experiment: turbo2 (2-bit) format on Metal #8

@TheTom

Description

@TheTom

Hypothesis

2-bit quantization (4 centroids) trades quality for maximum compression, enabling 70B+ models on consumer hardware.

Background

4 centroids instead of 8. On Metal's constrained constant cache, fewer reads (2 vs 4) = bigger relative win. Quality will degrade but useful for extreme memory pressure.

What to test

  • Implement 2-bit codebook (4 centroids + norm)
  • PPL comparison vs turbo3 and q4_0
  • Decode speed (fewer LUT reads should help on M2 Pro especially)
  • NIAH at 2-bit
  • Memory savings at 128K context

Expected outcome

~7x compression, significant PPL degradation but potentially usable for large models.

Priority

Low — exploratory.

Source

AutoRepl: TODO-010 (buun, fork_dc582a)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions