Add support for vulkan dylibs by aittalam · Pull Request #892 · mozilla-ai/llamafile

aittalam · 2026-02-25T13:06:42Z

Brings support for Vulkan dylibs into llamafile.

Tested on:

Linux (Ubuntu 24.04)
Steam Deck (SteamOS 3.7.20 Build: 20260108.1)
Podman container running on MacOS

A Windows build script will be created and tested in a follow-up PR

* Added back support for --image in CLI tool * Added tests for multimodal cli * Added optional mmproj parameter to TUI tests too * Addressed review comments * Added test to check multiple markers/images on cli

@dpoulopoulos

* Updated index.md * Moar updates to index.md * Updated quickstart.md * Updated support + example llamafiles * Added example files and examples + minor fixes * Updated structure * Removed security * Updated source installation * Updated README_0.10.0, now frozen doc * Removed ref to new_build_wip in whisperfile, make setup installs cosmocc * Apply suggestion from @dpoulopoulos Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com> * Addressed review comments * Addressed review comments #2 --------- Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>

* Added per-mode help + nologo/ascii support * If model is missing, bump to help for respective mode

* Updated skill not to use new_build_wip + improved it * Removed stray new_build_wip reference

aittalam · 2026-03-23T22:57:36Z

Update: I just merged some code to fix the following:

--fit error (the issue was related to how ggml buffers were freed and we fixed it the same way we did with Metal, see these old notes)
fit ignoring the mmproj model weights (so if a multimodal model is loaded, the context size is not reduced enough to fit both the model's and the projector's weights, resulting in an OOM + coredump)
calculation of free memory in the table at exit resulted in an underflow

We should do some tests to try this on different platforms, at least:

Linux with AMD GPU (steam deck)
Linux with CUDA
Windows
containerized linux (@dpoulopoulos I know the answer already ;-), just wanted to sync the merge of this PR with some pointers to your work 🙏 )

@wingenlit if you have some time to try it, let us know how it works with your setup! Happy to iterate on this until we are confident enough this works reliably.

wingenlit · 2026-03-24T08:19:32Z

hello @aittalam thank for fixing.

I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

aittalam · 2026-03-24T16:03:11Z

hello @aittalam thank for fixing.

I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

Wooohoo I love the sparkfile idea 💙
We are (still, sorry!) working on the win dll build - but it's great to know you'll be willing to test it afterwards 🙏😇
I think we'll first try to make sure we have a stable build outside of the APE, then reintroduce the rebuild option if ppl are happy with it (just out of curiosity, were you an active user of it?).

wingenlit · 2026-03-24T20:19:43Z

hello @aittalam thank for fixing.
I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

Wooohoo I love the sparkfile idea 💙 We are (still, sorry!) working on the win dll build - but it's great to know you'll be willing to test it afterwards 🙏😇 I think we'll first try to make sure we have a stable build outside of the APE, then reintroduce the rebuild option if ppl are happy with it (just out of curiosity, were you an active user of it?).

Hi, I’ve been on llamafile since tinyblas came out. What I like is how llamafile hides many os/arch complexity, so it’s easy to share with less technical people to run gguf. tinyblas out-of-the-box on gpu gives a nice baseline acceleration. However, to take advantage of platform acceleration (like cublas) without writing a lengthy instruction of how-to, the recompile option is the one important trick. It’s been a good way to get near-full performance while keeping things lightweight.

aittalam · 2026-03-26T11:58:56Z

Worked properly on a linux pod, after rebuilding the library from scratch. Performance is not quite the same as with CUDA+cuBLAS drivers but looks quite good anyway.

GPU: L40
Model: Qwen3.5-9B-Q5_K_S
CUDA (with cuBLAS, necessary as Qwen3.5 models require cuBLAS TRSM which is not available with TinyBLAS):

Vulkan:

Tested with the library built on the steam deck in a distrobox: the library can be ported on
ubuntu:24.04 as it shares the same libs as the current version of SteamOS

Steam Deck setup:

# create an ubuntu container to build the lib
distrobox create --name devbuntu --image ubuntu:24.04
distrobox enter devbuntu

# install all the required deps
sudo apt update && sudo apt install git glslc build-essential libvulkan-dev glslang-tools

aittalam · 2026-03-26T12:33:35Z

Code review

Found 4 issues:

Cross-module delete on host buffer struct that may be allocated by the main executable. ggml_backend_vk_host_buffer_type_alloc_buffer calls ggml_backend_cpu_buffer_from_ptr, which allocates the ggml_backend_buffer struct via the main executable's new operator. The patch then sets free_struct = ggml_backend_vk_buffer_free_struct, which calls delete buffer from the dylib's allocator. This is a heap mismatch -- the regular Vulkan buffer path is correct (dylib allocates and frees), but the host buffer path crosses module boundaries.

llamafile/llama.cpp.patches/patches/ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch

Lines 27 to 33 in 3a76415

    
           @@ -13388,6 +13397,7 @@ static ggml_backend_buffer_t ggml_backend_vk_host_buffer_type_alloc_buffer(ggml_ 
        
                ggml_backend_buffer_t buffer = ggml_backend_cpu_buffer_from_ptr(ptr, size); 
        
                buffer->buft = buft; 
        
                buffer->iface.free_buffer = ggml_backend_vk_host_buffer_free_buffer; 
        
           +    buffer->iface.free_struct = ggml_backend_vk_buffer_free_struct; 
        
                return buffer;

Missing log suppression before ggml_backend_register() in ImportVulkanImpl(). In cuda.c, there is an explicit pattern (lines 165-169) that suppresses GGML logging before backend registration to prevent noisy device enumeration output when --verbose is not set. The Vulkan implementation registers the backend without this suppression.

llamafile/llamafile/vulkan.c

Lines 140 to 148 in 3a76415

    
           // Register the Vulkan backend with GGML 
        
           if (g_vulkan.backend_reg) { 
        
               ggml_backend_reg_t reg = g_vulkan.backend_reg(); 
        
               if (reg) { 
        
                   ggml_backend_register(reg); 
        
                   if (FLAG_verbose) 
        
                       fprintf(stderr, "vulkan: Vulkan backend registered with GGML\n"); 
        
               } 
        
           }

Compare with the CUDA pattern:

llamafile/llamafile/cuda.c

Lines 165 to 175 in 3a76415

    
           // Suppress DSO's ggml logging before backend registration, which triggers 
        
           // ggml_cuda_init() inside the DSO. Without this, CUDA device enumeration 
        
           // messages appear even when --verbose is not set. 
        
           if (!FLAG_verbose && g_cuda.log_set) 
        
               g_cuda.log_set(llamafile_log_callback_null, NULL); 
        
           // Register the CUDA backend with GGML 
        
           if (g_cuda.backend_reg) { 
        
               ggml_backend_reg_t reg = g_cuda.backend_reg(); 
        
               if (reg) { 
        
                   ggml_backend_register(reg);

llamafile_vulkan_log_set() is defined and declared but never called in main.cpp or chatbot_main.cpp. Both files call llamafile_metal_log_set and llamafile_cuda_log_set to suppress verbose GPU logging when --verbose is off, but neither calls llamafile_vulkan_log_set. This same gap was flagged and fixed in PR Add cuda support #859 for CUDA.

llamafile/llamafile/vulkan.c

Lines 207 to 212 in 3a76415

    
           void llamafile_vulkan_log_set(llamafile_log_callback log_callback, void *user_data) { 
        
               if (!llamafile_has_vulkan()) 
        
                   return; 
        
               if (g_vulkan.log_set) 
        
                   g_vulkan.log_set(log_callback, user_data); 
        
           }

llamafile_vulkan_log_set() calls llamafile_has_vulkan() first (line 208), which triggers the full initialization path including ggml_backend_register() before the log callback is set. This violates the contract documented in metal.c (line 663: "This must be set BEFORE llamafile_has_metal() is called"). The Metal implementation uses a pending callback pattern -- Vulkan should do the same.

llamafile/llamafile/vulkan.c

Lines 207 to 212 in 3a76415

    
           void llamafile_vulkan_log_set(llamafile_log_callback log_callback, void *user_data) { 
        
               if (!llamafile_has_vulkan()) 
        
                   return; 
        
               if (g_vulkan.log_set) 
        
                   g_vulkan.log_set(log_callback, user_data); 
        
           }

Compare with the Metal pattern:

llamafile/llamafile/metal.c

Lines 661 to 672 in 3a76415

    
           void llamafile_metal_log_set(llamafile_log_callback log_callback, void *user_data) { 
        
               // Store as pending callback - will be applied when dylib loads 
        
               // This must be set BEFORE llamafile_has_metal() is called 
        
               g_metal_pending_log.callback = log_callback; 
        
               g_metal_pending_log.user_data = user_data; 
        
               g_metal_pending_log.is_set = true; 
        
               // If dylib is already loaded, apply immediately 
        
               if (g_metal.lib_handle && g_metal.log_set) { 
        
                   g_metal.log_set(log_callback, user_data); 
        
               } 
        
           }

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

aittalam · 2026-03-26T13:49:25Z

Issue 1 (Cross-module memory management): The DSO compiles its own copy of ggml-backend.cpp (via compile_ggml_core in build-functions.sh), so ggml_backend_cpu_buffer_from_ptr resolves to the DSO's own copy and both allocation and deallocation happen within the same module
Issue 2 (Missing log suppression): good catch! Added log suppression when --verbose is not set, matching the CUDA pattern
Issue 3 (Unimplemented log callback): added llamafile_vulkan_log_set() calls in all three entry points (main.cpp, chatbot_main.cpp, and server.cpp)
Issue 4 (Premature initialization): updated llamafile_vulkan_log_set() to use a pending callback mechanism matching Metal's pattern. It now stores the callback without triggering initialization

(see commit f3d1428)

wingenlit · 2026-03-26T17:28:32Z

speed test: model used Qwen3.5-9B-Q8_0.gguf

llama-benchy (0.3.5)

uv run llama-benchy --base-url local_url --model model --tokenizer Qwen/Qwen3.5-9B --concurrency 1

llama.cpp
commit 7f5ee549683d600ad41db6a295a232cdd2d8eb9f (HEAD, tag: b8198)

Results

GB10: (updated)
current llamafile vulkan

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
model	pp2048	586.63 ± 2.11		3494.88 ± 12.60	3492.89 ± 12.60	3494.92 ± 12.59
model	tg32	17.26 ± 1.35	18.00 ± 1.41

native llama.cpp vulkan

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
model	pp2048	1244.57 ± 12.28		1648.47 ± 16.46	1646.24 ± 16.46	1648.50 ± 16.45
model	tg32	17.45 ± 0.00	18.00 ± 0.00

AMD MI50:
current llamafile vulkan

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
model	pp2048	261.20 ± 0.46		7846.12 ± 15.55	7843.44 ± 15.55	7846.15 ± 15.55
model	tg32	19.03 ± 1.47	20.00 ± 1.41

native llama.cpp vulkan

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
model	pp2048	419.78 ± 0.64		4883.93 ± 7.43	4881.14 ± 7.43	4883.96 ± 7.43
model	tg32	26.94 ± 2.29	27.67 ± 2.36

Reference llamafile CUDA GB10

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
model	pp2048	1537.04 ± 4.47		1335.27 ± 3.89	1333.09 ± 3.89	1335.34 ± 3.90
model	tg32	17.18 ± 0.24	18.67 ± 0.94

aittalam · 2026-03-28T11:57:43Z

speed test: model used Qwen3.5-9B-Q8_0.gguf

Uuuuh thanks for running this! So the take-home messages are:

text generation on the DGX's GB10 is not much lower than native llama.cpp (good)
on AMD MI50 the difference is much larger (bad for llamafile, but also why? Would it be better with ROCm as CUDA is better on the GB10?)
in general, pp2048 is where llamafile's vulkan suffers the most - I'll need to check out what happens and how I can improve this...

This information is golden, thank you so much @wingenlit ! 🙏

aittalam added 30 commits November 20, 2025 11:49

Update llama.cpp to a44d7712 and refresh patches

8963741

Updated apply-patches and renames

3921f83

Removed Makefile patch, added as a removal

de256d6

Refactored patches to be applied in the llama.cpp dir

cbbefdb

Fixed apply-patches.sh

76b99bb

Updated tool_server_server.cpp.patch

a8291c6

Updated Makefile to pull llama.cpp deps

c98c65e

Update llama.cpp submodule with dependency submodules

a604991

added miniaudio.h.patch

137134f

Fixed wrong index in miniaudio patch

69d8662

Updated patches

d4208ef

Updates to server.cpp

03a0c7f

Added cosmocc-override.cmake

0f44e3b

Made patches minimal

c834f70

Removed common.h patch

1a63fea

Added cosmocc 4.0.2 target

b66b090

Added readme

a2c3175

Updated llama.cpp to commit a44d77126c911d105f7f800c17da21b2a5b112d1

9ee9a60

Updated llama.cpp to commit dbc15a79672e72e0b9c1832adddf3334f5c9229c

219e7b4

Updated patches for newer llama.cpp version

999a5f4

Added miniaudio

63b63e9

Updated patches with common/download.cpp

03eec4f

Updated patches with common/download.cpp

d6dcf56

Added extra deps to llama.cpp setup

d1d0633

Moved to using deps from the vendors folder

9edbc1f

Removed miniaudio from added files

a864b81

New BUILD.mk + common/chat.cpp patch

dda1664

Updated cosmocc to 4.0.2

90cad4c

Piping od with awk for better compatibility

824a70e

Renamed miniaudio patch 🤦

abe9784

aittalam and others added 12 commits March 15, 2026 20:00

Merge branch 'new_build_wip' into add-vulkan

5763489

Add image to cli (#912)

0b19932

* Added back support for --image in CLI tool * Added tests for multimodal cli * Added optional mmproj parameter to TUI tests too * Addressed review comments * Added test to check multiple markers/images on cli

Updated README.md, minor fix to docs/index.md

04cd7ae

Minor fixes to install llamafile binary

e08facd

Open next llama.cpp update PR to main

5216b45

Updated copyrights

28ad138

Removed old README, added 'based on' badges

725da50

Better version handling (#913)

963a5f4

Improve help (#914)

bae229c

* Added per-mode help + nologo/ascii support * If model is missing, bump to help for respective mode

Update skill docs (#915)

1102757

* Updated skill not to use new_build_wip + improved it * Removed stray new_build_wip reference

Merge branch 'new_build_wip' into add-vulkan

618e049

Base automatically changed from new_build_wip to main March 19, 2026 11:13

aittalam added 2 commits March 23, 2026 10:37

Merge main into branch

485a868

Fixed --fit issue + fit mmproj size + free mem (#920)

3a76415

aittalam marked this pull request as ready for review March 26, 2026 12:03

aittalam removed the request for review from dpoulopoulos March 26, 2026 14:48

aittalam merged commit c7d7e3e into main Mar 26, 2026
2 checks passed

aittalam deleted the add-vulkan branch March 26, 2026 14:49

aittalam mentioned this pull request Mar 28, 2026

Add windows build scripts for GPU libs #924

Open

3 tasks

aittalam mentioned this pull request Mar 31, 2026

Improve Vulkan dylib speed #931

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for vulkan dylibs#892

Add support for vulkan dylibs#892
aittalam merged 139 commits intomainfrom
add-vulkan

aittalam commented Feb 25, 2026 •

edited

Loading

Uh oh!

aittalam commented Mar 23, 2026 •

edited

Loading

Uh oh!

wingenlit commented Mar 24, 2026

Uh oh!

aittalam commented Mar 24, 2026

Uh oh!

wingenlit commented Mar 24, 2026

Uh oh!

aittalam commented Mar 26, 2026

Uh oh!

aittalam commented Mar 26, 2026

Uh oh!

aittalam commented Mar 26, 2026

Uh oh!

Uh oh!

wingenlit commented Mar 26, 2026 •

edited

Loading

Uh oh!

aittalam commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aittalam commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aittalam commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wingenlit commented Mar 24, 2026

Uh oh!

aittalam commented Mar 24, 2026

Uh oh!

wingenlit commented Mar 24, 2026

Uh oh!

aittalam commented Mar 26, 2026

Uh oh!

aittalam commented Mar 26, 2026

Code review

Uh oh!

aittalam commented Mar 26, 2026

Uh oh!

Uh oh!

wingenlit commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aittalam commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aittalam commented Feb 25, 2026 •

edited

Loading

aittalam commented Mar 23, 2026 •

edited

Loading

wingenlit commented Mar 26, 2026 •

edited

Loading