Skip to content

Add support for vulkan dylibs#892

Merged
aittalam merged 139 commits intomainfrom
add-vulkan
Mar 26, 2026
Merged

Add support for vulkan dylibs#892
aittalam merged 139 commits intomainfrom
add-vulkan

Conversation

@aittalam
Copy link
Copy Markdown
Member

@aittalam aittalam commented Feb 25, 2026

Brings support for Vulkan dylibs into llamafile.

Tested on:

  • Linux (Ubuntu 24.04)
  • Steam Deck (SteamOS 3.7.20 Build: 20260108.1)
  • Podman container running on MacOS

A Windows build script will be created and tested in a follow-up PR

aittalam and others added 12 commits March 15, 2026 20:00
* Added back support for --image in CLI tool
* Added tests for multimodal cli
* Added optional mmproj parameter to TUI tests too
* Addressed review comments
* Added test to check multiple markers/images on cli
* Updated index.md

* Moar updates to index.md

* Updated quickstart.md

* Updated support + example llamafiles

* Added example files and examples + minor fixes

* Updated structure

* Removed security

* Updated source installation

* Updated README_0.10.0, now frozen doc

* Removed ref to new_build_wip in whisperfile, make setup installs cosmocc

* Apply suggestion from @dpoulopoulos

Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>

* Addressed review comments

* Addressed review comments #2

---------

Co-authored-by: Dimitris Poulopoulos <dimitris.a.poulopoulos@gmail.com>
* Added per-mode help + nologo/ascii support
* If model is missing, bump to help for respective mode
* Updated skill not to use new_build_wip + improved it
* Removed stray new_build_wip reference
Base automatically changed from new_build_wip to main March 19, 2026 11:13
@aittalam
Copy link
Copy Markdown
Member Author

aittalam commented Mar 23, 2026

Update: I just merged some code to fix the following:

  • --fit error (the issue was related to how ggml buffers were freed and we fixed it the same way we did with Metal, see these old notes)
  • fit ignoring the mmproj model weights (so if a multimodal model is loaded, the context size is not reduced enough to fit both the model's and the projector's weights, resulting in an OOM + coredump)
  • calculation of free memory in the table at exit resulted in an underflow

We should do some tests to try this on different platforms, at least:

  • Linux with AMD GPU (steam deck)
  • Linux with CUDA
  • Windows
  • containerized linux (@dpoulopoulos I know the answer already ;-), just wanted to sync the merge of this PR with some pointers to your work 🙏 )

@wingenlit if you have some time to try it, let us know how it works with your setup! Happy to iterate on this until we are confident enough this works reliably.

@wingenlit
Copy link
Copy Markdown
Contributor

hello @aittalam thank for fixing.

I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

@aittalam
Copy link
Copy Markdown
Member Author

hello @aittalam thank for fixing.

I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

Wooohoo I love the sparkfile idea 💙
We are (still, sorry!) working on the win dll build - but it's great to know you'll be willing to test it afterwards 🙏😇
I think we'll first try to make sure we have a stable build outside of the APE, then reintroduce the rebuild option if ppl are happy with it (just out of curiosity, were you an active user of it?).

@wingenlit
Copy link
Copy Markdown
Contributor

hello @aittalam thank for fixing.
I made a build here. so far things work. will report issue if seen. Could you add windows dll build or the recompile flag (in the past)? I got no luck with AI migrated scripts at linking the core libs.

Wooohoo I love the sparkfile idea 💙 We are (still, sorry!) working on the win dll build - but it's great to know you'll be willing to test it afterwards 🙏😇 I think we'll first try to make sure we have a stable build outside of the APE, then reintroduce the rebuild option if ppl are happy with it (just out of curiosity, were you an active user of it?).

Hi, I’ve been on llamafile since tinyblas came out. What I like is how llamafile hides many os/arch complexity, so it’s easy to share with less technical people to run gguf. tinyblas out-of-the-box on gpu gives a nice baseline acceleration. However, to take advantage of platform acceleration (like cublas) without writing a lengthy instruction of how-to, the recompile option is the one important trick. It’s been a good way to get near-full performance while keeping things lightweight.

@aittalam
Copy link
Copy Markdown
Member Author

  • Worked properly on a linux pod, after rebuilding the library from scratch. Performance is not quite the same as with CUDA+cuBLAS drivers but looks quite good anyway.

GPU: L40
Model: Qwen3.5-9B-Q5_K_S
CUDA (with cuBLAS, necessary as Qwen3.5 models require cuBLAS TRSM which is not available with TinyBLAS):
image
Vulkan:
image

  • Tested with the library built on the steam deck in a distrobox: the library can be ported on
    ubuntu:24.04 as it shares the same libs as the current version of SteamOS

Steam Deck setup:

# create an ubuntu container to build the lib
distrobox create --name devbuntu --image ubuntu:24.04
distrobox enter devbuntu

# install all the required deps
sudo apt update && sudo apt install git glslc build-essential libvulkan-dev glslang-tools

@aittalam aittalam marked this pull request as ready for review March 26, 2026 12:03
@aittalam
Copy link
Copy Markdown
Member Author

Code review

Found 4 issues:

  1. Cross-module delete on host buffer struct that may be allocated by the main executable. ggml_backend_vk_host_buffer_type_alloc_buffer calls ggml_backend_cpu_buffer_from_ptr, which allocates the ggml_backend_buffer struct via the main executable's new operator. The patch then sets free_struct = ggml_backend_vk_buffer_free_struct, which calls delete buffer from the dylib's allocator. This is a heap mismatch -- the regular Vulkan buffer path is correct (dylib allocates and frees), but the host buffer path crosses module boundaries.

@@ -13388,6 +13397,7 @@ static ggml_backend_buffer_t ggml_backend_vk_host_buffer_type_alloc_buffer(ggml_
ggml_backend_buffer_t buffer = ggml_backend_cpu_buffer_from_ptr(ptr, size);
buffer->buft = buft;
buffer->iface.free_buffer = ggml_backend_vk_host_buffer_free_buffer;
+ buffer->iface.free_struct = ggml_backend_vk_buffer_free_struct;
return buffer;

  1. Missing log suppression before ggml_backend_register() in ImportVulkanImpl(). In cuda.c, there is an explicit pattern (lines 165-169) that suppresses GGML logging before backend registration to prevent noisy device enumeration output when --verbose is not set. The Vulkan implementation registers the backend without this suppression.

// Register the Vulkan backend with GGML
if (g_vulkan.backend_reg) {
ggml_backend_reg_t reg = g_vulkan.backend_reg();
if (reg) {
ggml_backend_register(reg);
if (FLAG_verbose)
fprintf(stderr, "vulkan: Vulkan backend registered with GGML\n");
}
}

Compare with the CUDA pattern:

llamafile/llamafile/cuda.c

Lines 165 to 175 in 3a76415

// Suppress DSO's ggml logging before backend registration, which triggers
// ggml_cuda_init() inside the DSO. Without this, CUDA device enumeration
// messages appear even when --verbose is not set.
if (!FLAG_verbose && g_cuda.log_set)
g_cuda.log_set(llamafile_log_callback_null, NULL);
// Register the CUDA backend with GGML
if (g_cuda.backend_reg) {
ggml_backend_reg_t reg = g_cuda.backend_reg();
if (reg) {
ggml_backend_register(reg);

  1. llamafile_vulkan_log_set() is defined and declared but never called in main.cpp or chatbot_main.cpp. Both files call llamafile_metal_log_set and llamafile_cuda_log_set to suppress verbose GPU logging when --verbose is off, but neither calls llamafile_vulkan_log_set. This same gap was flagged and fixed in PR Add cuda support #859 for CUDA.

void llamafile_vulkan_log_set(llamafile_log_callback log_callback, void *user_data) {
if (!llamafile_has_vulkan())
return;
if (g_vulkan.log_set)
g_vulkan.log_set(log_callback, user_data);
}

  1. llamafile_vulkan_log_set() calls llamafile_has_vulkan() first (line 208), which triggers the full initialization path including ggml_backend_register() before the log callback is set. This violates the contract documented in metal.c (line 663: "This must be set BEFORE llamafile_has_metal() is called"). The Metal implementation uses a pending callback pattern -- Vulkan should do the same.

void llamafile_vulkan_log_set(llamafile_log_callback log_callback, void *user_data) {
if (!llamafile_has_vulkan())
return;
if (g_vulkan.log_set)
g_vulkan.log_set(log_callback, user_data);
}

Compare with the Metal pattern:

llamafile/llamafile/metal.c

Lines 661 to 672 in 3a76415

void llamafile_metal_log_set(llamafile_log_callback log_callback, void *user_data) {
// Store as pending callback - will be applied when dylib loads
// This must be set BEFORE llamafile_has_metal() is called
g_metal_pending_log.callback = log_callback;
g_metal_pending_log.user_data = user_data;
g_metal_pending_log.is_set = true;
// If dylib is already loaded, apply immediately
if (g_metal.lib_handle && g_metal.log_set) {
g_metal.log_set(log_callback, user_data);
}
}

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@aittalam
Copy link
Copy Markdown
Member Author

  • Issue 1 (Cross-module memory management): The DSO compiles its own copy of ggml-backend.cpp (via compile_ggml_core in build-functions.sh), so ggml_backend_cpu_buffer_from_ptr resolves to the DSO's own copy and both allocation and deallocation happen within the same module

  • Issue 2 (Missing log suppression): good catch! Added log suppression when --verbose is not set, matching the CUDA pattern

  • Issue 3 (Unimplemented log callback): added llamafile_vulkan_log_set() calls in all three entry points (main.cpp, chatbot_main.cpp, and server.cpp)

  • Issue 4 (Premature initialization): updated llamafile_vulkan_log_set() to use a pending callback mechanism matching Metal's pattern. It now stores the callback without triggering initialization

(see commit f3d1428)

@aittalam aittalam removed the request for review from dpoulopoulos March 26, 2026 14:48
@aittalam aittalam merged commit c7d7e3e into main Mar 26, 2026
2 checks passed
@aittalam aittalam deleted the add-vulkan branch March 26, 2026 14:49
@wingenlit
Copy link
Copy Markdown
Contributor

wingenlit commented Mar 26, 2026

speed test: model used Qwen3.5-9B-Q8_0.gguf

llama-benchy (0.3.5)

uv run llama-benchy --base-url local_url --model model --tokenizer Qwen/Qwen3.5-9B --concurrency 1

llama.cpp
commit 7f5ee549683d600ad41db6a295a232cdd2d8eb9f (HEAD, tag: b8198)

Results

GB10: (updated)
current llamafile vulkan

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
model pp2048 586.63 ± 2.11 3494.88 ± 12.60 3492.89 ± 12.60 3494.92 ± 12.59
model tg32 17.26 ± 1.35 18.00 ± 1.41

native llama.cpp vulkan

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
model pp2048 1244.57 ± 12.28 1648.47 ± 16.46 1646.24 ± 16.46 1648.50 ± 16.45
model tg32 17.45 ± 0.00 18.00 ± 0.00

AMD MI50:
current llamafile vulkan

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
model pp2048 261.20 ± 0.46 7846.12 ± 15.55 7843.44 ± 15.55 7846.15 ± 15.55
model tg32 19.03 ± 1.47 20.00 ± 1.41

native llama.cpp vulkan

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
model pp2048 419.78 ± 0.64 4883.93 ± 7.43 4881.14 ± 7.43 4883.96 ± 7.43
model tg32 26.94 ± 2.29 27.67 ± 2.36

Reference llamafile CUDA GB10

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
model pp2048 1537.04 ± 4.47 1335.27 ± 3.89 1333.09 ± 3.89 1335.34 ± 3.90
model tg32 17.18 ± 0.24 18.67 ± 0.94

@aittalam
Copy link
Copy Markdown
Member Author

speed test: model used Qwen3.5-9B-Q8_0.gguf

Uuuuh thanks for running this! So the take-home messages are:

  • text generation on the DGX's GB10 is not much lower than native llama.cpp (good)
  • on AMD MI50 the difference is much larger (bad for llamafile, but also why? Would it be better with ROCm as CUDA is better on the GB10?)
  • in general, pp2048 is where llamafile's vulkan suffers the most - I'll need to check out what happens and how I can improve this...

This information is golden, thank you so much @wingenlit ! 🙏

@aittalam aittalam mentioned this pull request Mar 31, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants