Draft MLX go backend for new engine by dhiltgen · Pull Request #9118 · ollama/ollama

dhiltgen · 2025-02-14T19:28:17Z

Replaces #8490 on main.

Carries #9115 which should merge first.

A few key points:

Q/K tensor adjustments are applied globally which is incorrect (should be model specific)
The cache implementation on MLX is partially functional, but seems to drift after multiple forward passes and needs more work
Only llama3 fp16 models load currently - more work needed to get the other models working, and support more quantizations
Temporary env var to toggle which backend OLLAMA_BACKEND set to ggml or mlx

To see it working:

cmake -S . -B build
cmake --build build -j 
go build .
OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve

Then

ollama run llama3.1:8b-instruct-fp16

RafaAguilar · 2025-03-12T21:38:34Z

ICYMI, the new gemma3 architecture is on the main branch but it is not supported, I just tried:

time=2025-03-12T22:33:14.590+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/what_are_you_looking_for?/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=0 parallel=4 available=51539607552 required="21.4 GiB"
time=2025-03-12T22:33:14.591+01:00 level=INFO source=server.go:97 msg="system memory" total="64.0 GiB" free="23.6 GiB" free_swap="0 B"
time=2025-03-12T22:33:14.592+01:00 level=INFO source=server.go:130 msg=offload library=metal layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[48.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.4 GiB" memory.required.partial="21.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="18.2 GiB" memory.weights.repeating="17.1 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
time=2025-03-12T22:33:14.594+01:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/what_are_you_looking_for?/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 8 --parallel 4 --port 61315"
time=2025-03-12T22:33:14.596+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-12T22:33:14.596+01:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-12T22:33:14.597+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-12T22:33:14.628+01:00 level=INFO source=runner.go:858 msg="starting ollama engine"
time=2025-03-12T22:33:14.629+01:00 level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:61315"
time=2025-03-12T22:33:14.690+01:00 level=WARN source=ggml.go:136 msg="key not found" key=general.name default=""
time=2025-03-12T22:33:14.691+01:00 level=WARN source=ggml.go:136 msg="key not found" key=general.description default=""
time=2025-03-12T22:33:14.691+01:00 level=INFO source=ggml.go:97 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-03-12T22:33:14.694+01:00 level=INFO source=ggml.go:121 msg=gpu device.name=Metal device.description="Apple M1 Max" device.kind=gpu device.free="48.0 GiB" device.total="48.0 GiB"
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
time=2025-03-12T22:33:14.848+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-03-12T22:33:14.855+01:00 level=INFO source=ggml.go:112 msg=cpu device.name=CPU device.description="Apple M1 Max" device.kind=cpu device.free="0 B" device.total="0 B"
time=2025-03-12T22:33:15.307+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-12T22:33:18.985+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
panic: unsupported model architecture "gemma3"

goroutine 5 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0x140000a4f80, {0x16efdf432?, 0x0?}, {0x8, 0x0, 0x3f, {0x0, 0x0, 0x0}, 0x0}, ...)
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:786 +0x324
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:888 +0x804
time=2025-03-12T22:33:28.634+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-12T22:33:30.695+01:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 2"

dhiltgen · 2025-03-12T23:30:59Z

@RafaAguilar only llama3 fp16 models are currently working with the MLX backend. I need to refine the loading code to support more models, debug the other model definitions, as well as add quantization support before this can come out of draft state.

RafaAguilar · 2025-03-13T07:16:21Z

Gotcha, if you need testers I would gladly help you.

I will try to get into the code, but I'm not sure I have the time to delve into matrixes the following weeks, although I'll try.

dhiltgen · 2025-03-14T15:47:16Z

I've added some quant compatibility to be able to load more models with Q4_0, Q6_0, and Q8_0 tensors, however they're all converted to FP16 at load time so they don't provide quantization "benefits." I should be able to implement proper Q4 and Q8 support (with the benefits of reduced VRAM usage) once we get the new raw weight model loading implemented.

Functional implementation on the latest backend and caching code Still has some debugging that needs rebasing/cleanup one unit test fails which still needs work...

The cache still has some bugs.

jessegross · 2025-03-22T00:25:58Z

ml/backend/mlx/mlx.go

+) ml.Tensor {
+	a = a.Reshape(ctx, append([]int{1}, a.Shape()...)...).Permute(ctx, 0, 2, 1, 3).(*Array)
+	// TODO figure out how to get offset wired up
+	offset := 0


Not using positionIDs is probably also part of the issue where things fall apart after a few forward passes. The prompt will get correctly RoPEd since it starts at offset 0 and is all in one batch. However, every token after that will be at position 0 as well.

From what I saw from quickly looking before, it looks like offset is only a scalar in the C interface of MLX but it can be a vector in the Python version, same as we have here:
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.fast.rope.html

crazyi · 2025-03-22T04:14:02Z

@dhiltgen Thanks for your work, I wonder when will the mlx for ollama come out of box?

nercone-dev · 2025-03-22T12:29:19Z

I don't know if it will help, but applying the following file in some way may resolve the error in the action:

array.h by ml-explore/mlx

fluffypony · 2025-05-01T17:57:49Z

Really excited about this - thanks for your effort!

fakebizprez · 2025-05-01T18:13:13Z

Also very excited to see this shipped!

michaelmarziani · 2025-05-16T08:08:28Z

Thank you for Ollama, and looking forward to MLX support. I'd be happy to test on my M4 Pro.

region23 · 2025-05-26T07:47:38Z

Hi! Are still any plans to integrate MLX support to ollama?

fakebizprez · 2025-06-27T05:42:25Z

Any updates on this PR?

ghGarnik · 2025-07-09T07:41:01Z

Any plans to move this forward?

mircoporetti · 2025-08-10T08:18:25Z

Hi! Waiting for this a lot 🙂 any news?

BrennonTWilliams · 2025-08-22T19:21:33Z

Any updates? Can't wait for MLX support in Ollama!

hungngodev · 2025-08-25T04:57:51Z

Really need mlx support!! Thank youu

alex-pradas · 2025-09-11T09:27:18Z

Any updates? Can't wait for MLX support in Ollama!
Thanks for your efforts

rbc33 · 2025-09-14T20:14:58Z

ollama!! c'mon with mlx! YOU CAN DO IT!!

ashokgelal · 2025-09-14T20:20:35Z

If anyone wants to use MLX while we wait for Ollama natively supporting it, we recently added experimental MLX support to Msty (1) with similar interface to as that of Ollama - downloading and managing MLX models, using them in split chats and in RAG, and much more.

1: https://msty.ai

CamilleHbp · 2025-09-23T06:53:54Z

If anyone wants to use MLX while we wait for Ollama natively supporting it, we recently added experimental MLX support to Msty (1) with similar interface to as that of Ollama - downloading and managing MLX models, using them in split chats and in RAG, and much more.

1: https://msty.ai

Sadly this is a big no from me since the project is not open-source.

TomLucidor · 2025-11-03T07:26:54Z

@CamilleHbp what about Jan, are they MLX compatible? We know that LMStudio is not FOSS but they had a lot of traction.

Also how is this taking 6 months for Ollama to adopt, why can't we just get GGUF (with quantization no less) and convert it to something convenient for Mac GPUs? Watching GGUF hogging unified memory is a bit nuts.

pranavkafle · 2025-11-05T10:41:26Z

Ollama! Come on now! You'll change the game with MLX support.

TomLucidor · 2025-11-05T10:52:07Z

@pranavkafle even the alternatives are not safe containers/ramalama#2104

itinance · 2025-12-12T13:39:17Z

Hey. What's the current status with that? I'm hungry for mlx support :)

rbc33 · 2025-12-12T17:53:11Z

we demand mlx supponrt now! if it's possible

TG-Techie · 2025-12-23T05:02:27Z

Is this something an extra brain and/or hands would help with?

TomLucidor · 2025-12-23T10:49:28Z

@TG-Techie update the draft to match current codebase, and also put up with "procedural issues".

KholkinDmitrii · 2026-01-08T15:35:00Z

It would be ideal to have mlx support!

dhiltgen · 2026-01-08T19:00:47Z

This is now replaced by #13648 which includes an experimental MLX backend and initial image generation support for Mac and Linux.

dhiltgen mentioned this pull request Feb 14, 2025

Draft MLX go backend for new engine #8490

Closed

kconner mentioned this pull request Feb 18, 2025

MLX backend #1730

Open

davidmezzetti mentioned this pull request Feb 18, 2025

Integrate with mlx neuml/txtai#619

Closed

dhiltgen force-pushed the next-mlx branch 4 times, most recently from c2d784a to f1fe325 Compare March 3, 2025 19:11

dhiltgen force-pushed the next-mlx branch 4 times, most recently from ffaab1a to f656e79 Compare March 12, 2025 18:45

dhiltgen force-pushed the next-mlx branch 4 times, most recently from b087393 to b10df34 Compare March 13, 2025 22:48

dhiltgen force-pushed the next-mlx branch from b10df34 to c8f346d Compare March 17, 2025 16:49

dhiltgen added 2 commits March 20, 2025 08:26

Row Order model definitions

1296b39

Functional implementation on the latest backend and caching code Still has some debugging that needs rebasing/cleanup one unit test fails which still needs work...

Add MLX Backend POC

d1a7607

The cache still has some bugs.

dhiltgen force-pushed the next-mlx branch from c8f346d to d1a7607 Compare March 20, 2025 16:58

jessegross reviewed Mar 22, 2025

View reviewed changes

reneleonhardt mentioned this pull request Apr 24, 2025

Integrate with llm for local models Beingpax/VoiceInk#65

Open

g0732171 mentioned this pull request Jun 10, 2025

What is the difference between ollama and swama? Trans-N-ai/swama#15

Open

Sekanato mentioned this pull request Jul 7, 2025

[FR] Allow custom OpenAi api base url AppFlowy-IO/AppFlowy-Cloud#943

Open

lucaorio mentioned this pull request Sep 25, 2025

Setting reasoning_effort through the API does not work lmstudio-ai/lmstudio-bug-tracker#988

Open

dhiltgen closed this Jan 8, 2026

iamadalek mentioned this pull request Jan 8, 2026

Add experimental MLX backend and engine with imagegen support #13648

Merged

Extra-Citron-7630 mentioned this pull request Jan 26, 2026

Failed to generate image: HTTP error! status: 500 nicklansley/OllamaImageGenerator#2

Closed

Conversation

dhiltgen commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RafaAguilar commented Mar 12, 2025

Uh oh!

dhiltgen commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RafaAguilar commented Mar 13, 2025

Uh oh!

dhiltgen commented Mar 14, 2025

Uh oh!

jessegross Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

crazyi commented Mar 22, 2025

Uh oh!

nercone-dev commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fluffypony commented May 1, 2025

Uh oh!

fakebizprez commented May 1, 2025

Uh oh!

michaelmarziani commented May 16, 2025

Uh oh!

region23 commented May 26, 2025

Uh oh!

fakebizprez commented Jun 27, 2025

Uh oh!

ghGarnik commented Jul 9, 2025

Uh oh!

mircoporetti commented Aug 10, 2025

Uh oh!

BrennonTWilliams commented Aug 22, 2025

Uh oh!

hungngodev commented Aug 25, 2025

Uh oh!

alex-pradas commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbc33 commented Sep 14, 2025

Uh oh!

ashokgelal commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CamilleHbp commented Sep 23, 2025

Uh oh!

TomLucidor commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pranavkafle commented Nov 5, 2025

Uh oh!

TomLucidor commented Nov 5, 2025

Uh oh!

itinance commented Dec 12, 2025

Uh oh!

rbc33 commented Dec 12, 2025

Uh oh!

TG-Techie commented Dec 23, 2025

Uh oh!

TomLucidor commented Dec 23, 2025

Uh oh!

KholkinDmitrii commented Jan 8, 2026

Uh oh!

dhiltgen commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

dhiltgen commented Feb 14, 2025 •

edited

Loading

dhiltgen commented Mar 12, 2025 •

edited

Loading

nercone-dev commented Mar 22, 2025 •

edited

Loading

alex-pradas commented Sep 11, 2025 •

edited

Loading

ashokgelal commented Sep 14, 2025 •

edited

Loading

TomLucidor commented Nov 3, 2025 •

edited

Loading