Add bitnet1.58 with custom metal kernel by Blaizzy · Pull Request #219 · ml-explore/mlx-lm

Blaizzy · 2025-06-08T22:23:27Z

This PR adds support for bitnet1.58 and implements a custom metal kernel that performs matrix multiplication directly on packed weights. This eliminates the need to store unpacked weights in memory. Additionaly, it also allows you to quantize this 1.58bit models using N-bit quants for better performance, read more here.

Models supported:

Bitnet1.58
Falcon-e (h/t @younesbelkada)

w/o the kernel

w/ the custom metal kernel

Note: I removed the N-bit input quantization because it's slower (-6 tokens/s) and doesn't provide significant memory savings, even with fused kernels. I believe it's best to use KV quant instead.

Blaizzy · 2025-06-08T22:28:23Z

Small but mighty!

Blaizzy · 2025-06-08T22:51:00Z

There is probably space for further optimizations. I have seen slightly faster inference for this model.

But I believe it's a good start!

Blaizzy · 2025-06-09T14:46:51Z

Implemented kernel caching and reduced precision from float32 to float16, achieving significant performance gains across all metrics:

Key Improvements:

Throughput: 2x increase overall
Peak TFLOPS: 45.86 TFLOPS (up from 20.19) — 127% improvement
MLP Forward Pass: 40% faster execution
Prompt Processing: 135.046 tokens/sec (up from 51) — 165% improvement
Generation Speed: 46.209 tokens/sec (up from 43.830) — 5% improvement
Peak Memory: 1.308 GB (down from 1.322 GB) — 1% reduction

Blaizzy · 2025-06-09T15:27:32Z

Fusing qkv can probably get us to around ~50 tokens but I couldn't find a way to join the scales and keep model coherence.

younesbelkada · 2025-06-10T08:05:36Z

Hi @Blaizzy - the results look impressive, this could potentially be easily ported to other existing Bitnet models such as Falcon-E or Falcon3-1.58 since we can use the same BitNetLinear kernel and are based on Llama architecture. I am happy to work on integrating these models as well, would the BitLinear class be re-usable by other models? How do you think we should approach this?

younesbelkada · 2025-06-10T08:21:11Z

Currently the way to infer if the models use bitnet for these models is to check this attribute: https://huggingface.co/tiiuae/Falcon-E-3B-Instruct/blob/main/config.json#L27

Blaizzy · 2025-06-10T09:11:38Z

Hi @Blaizzy - the results look impressive, this could potentially be easily ported to other existing Bitnet models such as Falcon-E or Falcon3-1.58 since we can use the same BitNetLinear kernel and are based on Llama architecture. I am happy to work on integrating these models as well, would the BitLinear class be re-usable by other models? How do you think we should approach this?

Thank you @younesbelkada!

Yes, I can put all the BitLinear Layer logic into a separate file that can be re-used. Much like the existing switch_layers.py

Funny you meantion falcon H1, I worked on a MLX version a few weeks back but in a separate repo. I would love to collaborate with you on it.

Blaizzy · 2025-06-10T09:12:06Z

Currently the way to infer if the models use bitnet for these models is to check this attribute: https://huggingface.co/tiiuae/Falcon-E-3B-Instruct/blob/main/config.json#L27

Awesome, that makes things easier!

younesbelkada · 2025-06-10T09:20:04Z

@Blaizzy - great to hear that ! I'll reach out to you separately by email (the one you put on your GH profile), I would love to colab as well for H1 and Bitnet ! Sending you an email now

Blaizzy · 2025-06-10T09:23:47Z

@Blaizzy - great to hear that ! I'll reach out to you separately by email (the one you put on your GH profile), I would love to colab as well for H1 and Bitnet ! Sending you an email now

Perfect!

I have moved BitLinear layer logic to a separate file for easier reusability across models and projects.

younesbelkada · 2025-06-10T09:25:07Z

Thank you @Blaizzy will give it a try and I'll either open a PR to your branch or on main if this PR gets merged quickly

Blaizzy · 2025-06-10T10:16:50Z

My pleasure!

You can send a PR to my branch 👌🏽

younesbelkada · 2025-06-10T10:35:50Z

Got a working version: Blaizzy#1

younesbelkada · 2025-06-10T10:41:52Z

Also works on previous Bitnet models, e.g. https://huggingface.co/tiiuae/Falcon3-7B-Instruct-1.58bit :

Blaizzy · 2025-06-10T10:45:27Z

Perfect, great job @younesbelkada ! 🚀

Blaizzy · 2025-06-10T10:48:15Z

I was about to text you saying I tried Falcon 3 but the hidden states where exploding to Infinity 😂

But the inverse change fixes it ✅

Blaizzy · 2025-06-10T11:03:21Z

Left a comment in your PR @younesbelkada after that is resolved we can merge.

younesbelkada · 2025-06-10T11:32:25Z

Thank you @Blaizzy ! Also sent you an email for collab 🙏
Looking forward to seeing this merged 🚀

younesbelkada · 2025-06-10T12:45:01Z

Performance of the 1B model:

Blaizzy · 2025-06-10T13:51:59Z

Just added support for N-bit quants for 1.58bit model and it further reduces peak memory and is faster.

@awni @angeloskath

Blaizzy · 2025-06-12T01:51:53Z

This last commit is small but makes a huge difference!

MLX is now officially faster than bitnet.cpp, without even using N-bit quants that add +10 tokens/s.

And I believe with fused qkv kernels we can get an additional 5-10 tokens/s :)

2T = 2 threads
4T = 4 threads

guillaume-osmo · 2025-06-12T07:21:53Z

@Blaizzy your code looks really nice. One question: your 0,1,2 to -1,0,1 shift is one additional operation can we avoid it ? second question: do you plan to include the fused qkv kernels too ?

Blaizzy · 2025-06-12T07:42:30Z

I'm afraid we can't avoid it.

•	2-bit packing – 4 weights per byte means each weight is an unsigned 2-bit code (00–11).
•	Ternary encoding – store {-1, 0, +1} as {00, 01, 10} and recover with bits - 1.
•	Mat-mul runs in FP16/FP32 – inputs are floats and Apple GPUs multiply-accumulate only in FP16/32 (or INT8 after unpacking), so every weight must be promoted to FP anyway.
•	Negligible cost – float(bits) - 1 is a single fused ALU op; any alternate encoding adds more instructions or an extra reduction.

Blaizzy · 2025-06-12T07:43:39Z

Yes, I plan to include the fused QKV kernels as soon as I find the best way to aggregate the weight scales for QKV.

mlx_lm/models/bitlinear_layers.py

Blaizzy · 2025-06-12T10:10:30Z

Fused QKV kernel is coming soon 🚀

guillaume-osmo · 2025-06-12T10:18:28Z

so on M3 128 GB: generate at 89.3 +/- 0.1 tok/sec.
Why the first call prompt is always slower (model loading maybe) ?
I have Prompt first time 52.5 and then 162.2 +/-0.1 tok/sec.
using:

mlx_lm.generate --model microsoft/bitnet-b1.58-2B-4T  --prompt "implement bubble sort from scratch" --max-tokens 10 --temp 0.7

with --max-tokens 100 => 81 +/-0.2 tok/sec

guillaume-osmo · 2025-06-12T10:24:29Z

Q: Do you know where I can find a good example script in order to fine-tuned a Bitnet model or Falcon using mlx ?

younesbelkada · 2025-06-12T10:35:10Z

Q: Do you know where I can find a good example script in order to fine-tuned a Bitnet model or Falcon using mlx ?

Not sure if this is implemented in MLX, but either for microsoft bitnet or Falcon-E you will need to download the pre-quantized weights (for bitnet it's in a separate repo: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16 - and for Falcon-E it's on a separate revision prequantized.) Then use another implementation of BitLinear tailored for training.
The implementation of MS-Bitnet is here: https://github.com/huggingface/transformers/blob/27459025b8f77d53631f7961cc967fa659d43f7e/src/transformers/integrations/bitnet.py#L307 (you need to consider online_quant)
The implementation of Falcon-E bitnet layer is here: https://github.com/tiiuae/onebitllms/blob/main/src/onebitllms/layers/bitnet.py#L20

awni · 2025-07-02T21:16:28Z

Ok so I posted some updates:

General cleanup
Removed support for Bitnet quantizing the Llama. Let's revisit that in a follow-on if we want to support that. But I want to get this landed and keep the diff self-contained
Improved the bitnet layer. There was some inefficient casing happening there so it's even a bit faster now.

This is good to merge. Thanks for the contributions everyone!!

Blaizzy · 2025-07-02T21:57:09Z

My pleasure!

The llama changes are there to allow for Falcon bitnet. I can make a separate PR if you don’t mind.

Blaizzy · 2025-07-02T22:11:37Z

There was some inefficient casing happening there so it's even a bit faster now.

Thanks a lot, I noticed the speed up! (100-> 115 tok/s)

static_cast<T> fixes the bug I was having 🙌🏽

awni · 2025-07-02T23:03:35Z

The llama changes are there to allow for Falcon bitnet. I can make a separate PR if you don’t mind.

Yes I gathered as much. You can send a PR.. I removed it from this one because the added complexity wasn't great. I don't yet have a good suggestions for how to do it in a simpler way.. but that's something that would be good to think through

* Port mlx-lm bitnet1.58 ml-explore/mlx-lm#219 * update: Update relu2 function to use compile for shapeless input * refactor: Update BitLinear & Format code * update: Add quantization parameters to BaseConfiguration * update: Improve error handling during weight pre-loading in ContentView * update: Add bitnet_b1_58_2b_4t_4bit model configuration to LLMModelFactory * update: Rename relu2 function to reluSquared and refactor implementation * update: ACKNOWLEDGMENTS.md * Improve the bitnet kernel * remove: eliminate reluSquared function from Bitnet.swift * refactor: update kernel

* add bitnet * update activation to relu2 * working bitnet * remove artifacts * remove logging * add custom post quant * fix dtype and add compile * fixed weight unpack * add custom kernel to avoid memory overhead * compile relu2 * fix weight scale * remove unused * add tests and update tuner utils * update acknowledgements * add kernel caching * add act_quant and set float16 as default dtype * use mx.add and move scaling to kernel * remove act quant * move bitlinear layers to separate file * feat: add falcon-e and other bitnet support * refactor: address comments * add support for 1.58bit N-bit quants * 43.85% speedup in generation performance (M3 max) * refactor utils * remove masking (2% gen speed improvement) * add quantization config * test llama bitnet * refactor apply_hf_quant * default threadgroup: 64 -> 32 * add comment * fix prompt processing perf * remove modulo * compile kernel in the constructor * Improve the bitnet kernel * remove benchmark * refactor bitlinear swap * format * remove llama changes * revert utils * faster + cleanup * not trainable * fix tests --------- Co-authored-by: younesbelkada <younes.belkada@tii.ae> Co-authored-by: Angelos Katharopoulos <a_katharopoulos@apple.com> Co-authored-by: Awni Hannun <awni@apple.com>

* Port mlx-lm bitnet1.58 ml-explore/mlx-lm#219 * update: Update relu2 function to use compile for shapeless input * refactor: Update BitLinear & Format code * update: Add quantization parameters to BaseConfiguration * update: Improve error handling during weight pre-loading in ContentView * update: Add bitnet_b1_58_2b_4t_4bit model configuration to LLMModelFactory * update: Rename relu2 function to reluSquared and refactor implementation * update: ACKNOWLEDGMENTS.md * Improve the bitnet kernel * remove: eliminate reluSquared function from Bitnet.swift * refactor: update kernel

younesbelkada mentioned this pull request Jun 10, 2025

feat: add falcon-e and other bitnet support Blaizzy/mlx-lm#1

Merged

guillaume-osmo reviewed Jun 12, 2025

View reviewed changes

mlx_lm/models/bitlinear_layers.py Outdated Show resolved Hide resolved

Blaizzy and others added 12 commits July 2, 2025 13:28

add quantization config

864a61b

test llama bitnet

d6774ef

refactor apply_hf_quant

9816723

default threadgroup: 64 -> 32

39e1ff9

add comment

a3f5e92

fix prompt processing perf

a2245d8

remove modulo

d5007f9

compile kernel in the constructor

8052446

Improve the bitnet kernel

b033163

remove benchmark

2fc40f9

refactor bitlinear swap

fa97515

format

3fa61b0

awni force-pushed the pc/add-bitnet branch from 2ffcb79 to 00842d2 Compare July 2, 2025 20:28

remove llama changes

7e1666b

awni force-pushed the pc/add-bitnet branch from 00842d2 to 7e1666b Compare July 2, 2025 20:30

awni added 2 commits July 2, 2025 13:59

revert utils

30745cd

faster + cleanup

5337d3b

not trainable

8c52d63

awni approved these changes Jul 2, 2025

View reviewed changes

fix tests

9c32774

awni merged commit 5fa62eb into ml-explore:main Jul 2, 2025
4 checks passed

younesbelkada mentioned this pull request Jul 3, 2025

Feat: add falcon-e support for bitnet models #268

Merged

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 16, 2025

Add sub 1bit streamk gemm #609

Closed

Conversation

Blaizzy commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy commented Jun 8, 2025

Uh oh!

Blaizzy commented Jun 8, 2025

Uh oh!

Blaizzy commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy commented Jun 9, 2025

Uh oh!

younesbelkada commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Blaizzy commented Jun 10, 2025

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

younesbelkada commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 10, 2025

Uh oh!

Blaizzy commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaume-osmo commented Jun 12, 2025

Uh oh!

Blaizzy commented Jun 12, 2025

Uh oh!

Blaizzy commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Blaizzy commented Jun 12, 2025

Uh oh!

guillaume-osmo commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaume-osmo commented Jun 12, 2025

Uh oh!

younesbelkada commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Jul 2, 2025

Uh oh!

Blaizzy commented Jul 2, 2025

Uh oh!

Blaizzy commented Jul 2, 2025

Uh oh!

awni commented Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Blaizzy commented Jun 8, 2025 •

edited

Loading

Blaizzy commented Jun 9, 2025 •

edited

Loading

younesbelkada commented Jun 10, 2025 •

edited

Loading

Blaizzy commented Jun 10, 2025 •

edited

Loading

Blaizzy commented Jun 10, 2025 •

edited

Loading

Blaizzy commented Jun 10, 2025 •

edited

Loading

Blaizzy commented Jun 12, 2025 •

edited

Loading

Blaizzy commented Jun 12, 2025 •

edited

Loading

guillaume-osmo commented Jun 12, 2025 •

edited

Loading

younesbelkada commented Jun 12, 2025 •

edited

Loading