Skip to content

Draft MLX go backend for new engine#9118

Closed
dhiltgen wants to merge 2 commits intoollama:mainfrom
dhiltgen:next-mlx
Closed

Draft MLX go backend for new engine#9118
dhiltgen wants to merge 2 commits intoollama:mainfrom
dhiltgen:next-mlx

Conversation

@dhiltgen
Copy link
Collaborator

@dhiltgen dhiltgen commented Feb 14, 2025

Replaces #8490 on main.

Carries #9115 which should merge first.

A few key points:

  • Q/K tensor adjustments are applied globally which is incorrect (should be model specific)
  • The cache implementation on MLX is partially functional, but seems to drift after multiple forward passes and needs more work
  • Only llama3 fp16 models load currently - more work needed to get the other models working, and support more quantizations
  • Temporary env var to toggle which backend OLLAMA_BACKEND set to ggml or mlx

To see it working:

cmake -S . -B build
cmake --build build -j 
go build .
OLLAMA_NEW_ENGINE=1 OLLAMA_BACKEND=mlx ollama serve

Then

ollama run llama3.1:8b-instruct-fp16

@kconner kconner mentioned this pull request Feb 18, 2025
@dhiltgen dhiltgen force-pushed the next-mlx branch 4 times, most recently from c2d784a to f1fe325 Compare March 3, 2025 19:11
@dhiltgen dhiltgen force-pushed the next-mlx branch 4 times, most recently from ffaab1a to f656e79 Compare March 12, 2025 18:45
@RafaAguilar
Copy link

ICYMI, the new gemma3 architecture is on the main branch but it is not supported, I just tried:

time=2025-03-12T22:33:14.590+01:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/what_are_you_looking_for?/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 gpu=0 parallel=4 available=51539607552 required="21.4 GiB"
time=2025-03-12T22:33:14.591+01:00 level=INFO source=server.go:97 msg="system memory" total="64.0 GiB" free="23.6 GiB" free_swap="0 B"
time=2025-03-12T22:33:14.592+01:00 level=INFO source=server.go:130 msg=offload library=metal layers.requested=-1 layers.model=63 layers.offload=63 layers.split="" memory.available="[48.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.4 GiB" memory.required.partial="21.4 GiB" memory.required.kv="3.9 GiB" memory.required.allocations="[21.4 GiB]" memory.weights.total="18.2 GiB" memory.weights.repeating="17.1 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
time=2025-03-12T22:33:14.594+01:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/what_are_you_looking_for?/.ollama/models/blobs/sha256-afa0ea2ef463c87a1eebb9af070e76a353107493b5d9a62e5e66f65a65409541 --ctx-size 8192 --batch-size 512 --n-gpu-layers 63 --threads 8 --parallel 4 --port 61315"
time=2025-03-12T22:33:14.596+01:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-12T22:33:14.596+01:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-12T22:33:14.597+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-12T22:33:14.628+01:00 level=INFO source=runner.go:858 msg="starting ollama engine"
time=2025-03-12T22:33:14.629+01:00 level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:61315"
time=2025-03-12T22:33:14.690+01:00 level=WARN source=ggml.go:136 msg="key not found" key=general.name default=""
time=2025-03-12T22:33:14.691+01:00 level=WARN source=ggml.go:136 msg="key not found" key=general.description default=""
time=2025-03-12T22:33:14.691+01:00 level=INFO source=ggml.go:97 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=36
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-icelake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-haswell.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-alderlake.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-sandybridge.so
ggml_backend_load_best: failed to load /Applications/Ollama.app/Contents/Resources/libggml-cpu-skylakex.so
time=2025-03-12T22:33:14.694+01:00 level=INFO source=ggml.go:121 msg=gpu device.name=Metal device.description="Apple M1 Max" device.kind=gpu device.free="48.0 GiB" device.total="48.0 GiB"
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
time=2025-03-12T22:33:14.848+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
time=2025-03-12T22:33:14.855+01:00 level=INFO source=ggml.go:112 msg=cpu device.name=CPU device.description="Apple M1 Max" device.kind=cpu device.free="0 B" device.total="0 B"
time=2025-03-12T22:33:15.307+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-12T22:33:18.985+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
panic: unsupported model architecture "gemma3"

goroutine 5 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0x140000a4f80, {0x16efdf432?, 0x0?}, {0x8, 0x0, 0x3f, {0x0, 0x0, 0x0}, 0x0}, ...)
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:786 +0x324
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:888 +0x804
time=2025-03-12T22:33:28.634+01:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-12T22:33:30.695+01:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: exit status 2"

@dhiltgen
Copy link
Collaborator Author

dhiltgen commented Mar 12, 2025

@RafaAguilar only llama3 fp16 models are currently working with the MLX backend. I need to refine the loading code to support more models, debug the other model definitions, as well as add quantization support before this can come out of draft state.

@RafaAguilar
Copy link

Gotcha, if you need testers I would gladly help you.

I will try to get into the code, but I'm not sure I have the time to delve into matrixes the following weeks, although I'll try.

@dhiltgen dhiltgen force-pushed the next-mlx branch 4 times, most recently from b087393 to b10df34 Compare March 13, 2025 22:48
@dhiltgen
Copy link
Collaborator Author

I've added some quant compatibility to be able to load more models with Q4_0, Q6_0, and Q8_0 tensors, however they're all converted to FP16 at load time so they don't provide quantization "benefits." I should be able to implement proper Q4 and Q8 support (with the benefits of reduced VRAM usage) once we get the new raw weight model loading implemented.

Functional implementation on the latest backend and caching code
Still has some debugging that needs rebasing/cleanup

one unit test fails which still needs work...
The cache still has some bugs.
) ml.Tensor {
a = a.Reshape(ctx, append([]int{1}, a.Shape()...)...).Permute(ctx, 0, 2, 1, 3).(*Array)
// TODO figure out how to get offset wired up
offset := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using positionIDs is probably also part of the issue where things fall apart after a few forward passes. The prompt will get correctly RoPEd since it starts at offset 0 and is all in one batch. However, every token after that will be at position 0 as well.

From what I saw from quickly looking before, it looks like offset is only a scalar in the C interface of MLX but it can be a vector in the Python version, same as we have here:
https://ml-explore.github.io/mlx/build/html/python/_autosummary/mlx.core.fast.rope.html

@crazyi
Copy link

crazyi commented Mar 22, 2025

@dhiltgen Thanks for your work, I wonder when will the mlx for ollama come out of box?

@nercone-dev
Copy link

nercone-dev commented Mar 22, 2025

I don't know if it will help, but applying the following file in some way may resolve the error in the action:

array.h by ml-explore/mlx

@fluffypony
Copy link

Really excited about this - thanks for your effort!

@fakebizprez
Copy link

Also very excited to see this shipped!

@michaelmarziani
Copy link

Thank you for Ollama, and looking forward to MLX support. I'd be happy to test on my M4 Pro.

@region23
Copy link

Hi! Are still any plans to integrate MLX support to ollama?

@fakebizprez
Copy link

Any updates on this PR?

@ghGarnik
Copy link

ghGarnik commented Jul 9, 2025

Any plans to move this forward?

@mircoporetti
Copy link

Hi! Waiting for this a lot 🙂 any news?

@BrennonTWilliams
Copy link

Any updates? Can't wait for MLX support in Ollama!

@hungngodev
Copy link

Really need mlx support!! Thank youu

@alex-pradas
Copy link

alex-pradas commented Sep 11, 2025

Any updates? Can't wait for MLX support in Ollama!
Thanks for your efforts

@rbc33
Copy link

rbc33 commented Sep 14, 2025

ollama!! c'mon with mlx! YOU CAN DO IT!!

@ashokgelal
Copy link
Contributor

ashokgelal commented Sep 14, 2025

If anyone wants to use MLX while we wait for Ollama natively supporting it, we recently added experimental MLX support to Msty (1) with similar interface to as that of Ollama - downloading and managing MLX models, using them in split chats and in RAG, and much more.

1: https://msty.ai

@CamilleHbp
Copy link

If anyone wants to use MLX while we wait for Ollama natively supporting it, we recently added experimental MLX support to Msty (1) with similar interface to as that of Ollama - downloading and managing MLX models, using them in split chats and in RAG, and much more.

1: https://msty.ai

Sadly this is a big no from me since the project is not open-source.

@TomLucidor
Copy link

TomLucidor commented Nov 3, 2025

@CamilleHbp what about Jan, are they MLX compatible? We know that LMStudio is not FOSS but they had a lot of traction.

Also how is this taking 6 months for Ollama to adopt, why can't we just get GGUF (with quantization no less) and convert it to something convenient for Mac GPUs? Watching GGUF hogging unified memory is a bit nuts.

@pranavkafle
Copy link

Ollama! Come on now! You'll change the game with MLX support.

@TomLucidor
Copy link

@pranavkafle even the alternatives are not safe containers/ramalama#2104

@itinance
Copy link

Hey. What's the current status with that? I'm hungry for mlx support :)

@rbc33
Copy link

rbc33 commented Dec 12, 2025

we demand mlx supponrt now! if it's possible

@TG-Techie
Copy link

Is this something an extra brain and/or hands would help with?

@TomLucidor
Copy link

@TG-Techie update the draft to match current codebase, and also put up with "procedural issues".

@KholkinDmitrii
Copy link

It would be ideal to have mlx support!

@dhiltgen
Copy link
Collaborator Author

dhiltgen commented Jan 8, 2026

This is now replaced by #13648 which includes an experimental MLX backend and initial image generation support for Mac and Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.