vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap#20059
Merged
0cc4m merged 10 commits intoggml-org:masterfrom Mar 12, 2026
Merged
Conversation
Contributor
Author
|
Hi @HumerousGorgon thanks for testing. I wasn't sure how you tested this but on my side it is working with d1dd814 using this model. Can you provide more details (how you executed on which environment using what model)? llama-cli output for Qwen3.5-35B-A3B-Q4_K_M.gguf |
0cc4m
reviewed
Mar 6, 2026
5 tasks
|
Tested on Arrow Lake (140T). b8191pr |
Contributor
Author
|
Currently seeing following validation error on e1f8ce0. Compared to 29a1a01 which was focused on fixing the async transfer part, the new approach seems to have a much wider influence causing issues in other phases. This fix will probably take more time since we need to thoroughly look through the command buffer/fence relation over the entire exection. Update: I was able to get this fixed by doing a more consistent reuse in a0fecda |
tekintian
added a commit
to tekintian/llama.cpp
that referenced
this pull request
Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Fixes #19420.
Overview
We were hitting an internal maximum number (16383) of command buffers for Intel's Windows GPU driver causing ErrorOutOfHostMemory when loading large models (1MB per transfer * 16383 == approx 16GB or more weight). This PR attempts to fix this by reusing command buffers that are done transferring data.
Test Results
llama-cli.exe -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --no-mmapshow no crashing on both Intel iGPU and NVIDIA dGPU. Chat results are correct as well.test-backend-ops.exepass on both MTL iGPU and NVIDIA dGPUBenchmark Results
Test environment
llama-cli.exe -m Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --no-mmap-> asktell me your nameTest Results (498ff28)
test-backend-ops log (partial)
Intel, NVIDIA results (Windows)
llama-cli logs with validation ON
MTL iGPU log
RTX 4090 Laptop GPU log
llama-bench Results (498ff28)
MTL iGPU
No change in pp512 (45-46 t/s) or tg128 (22-23 t/s) performance.
-mmp 0crashes on b8253 so no data.Before PR
After PR
NVIDIA RTX 4090 Laptop GPU
-ngl 43since it crashes with ErrorOutOfDeviceMemory onllama-benchBefore PR (b8253)
After PR
AI Disclosure
AI (GPT-5.3-Codex) was used for partial PoC coding, refactoring, and analysis.