llama : disable pipeline parallelism if compute buffer allocation fails by slaren · Pull Request #16748 · ggml-org/llama.cpp

slaren · 2025-10-23T22:59:22Z

Since pipeline parallelism increases memory usage, if the compute buffer allocation fails we can try to disable it before failing the context creation.

@ykhrustalev

* model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test * HIP: fix AMDGPU_TARGETS, update documentation (ggml-org#16803) * ggml : fix interpolate with align-corners and ne=1 (ggml-org#16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning * llama : disable pipeline parallelism if compute buffer allocation fails (ggml-org#16748) * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * chat: Add LFM2 tool handling (ggml-org#16763) * Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev * sycl: add SSM_CONV operation support (ggml-org#16800) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com> * CUDA: add unused vars to mmvf and mmvq (ggml-org#16807) * CANN: Improve device ID handling and aclnnArange checks (ggml-org#16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var * grammar : support array references in json schema (ggml-org#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * llama: consistent ctx <-> buf order for KV cache (ggml-org#16746) * embedding: add raw option for --embd-output-format (ggml-org#16541) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Acly <aclysia@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: tamarPal <tamarp3385@gmail.com> Co-authored-by: tamarPal <tamarPal@example.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sam Malayek <12037535+SamMalayek@users.noreply.github.com>

LostRuins · 2025-12-11T06:42:06Z

Necro'ing this thread, had someone trigger the fallback and I don't actually think it works?

This user had a 2 GPU setup and was experimenting with increasing context size (up until buffer allocation fails)
For this experiment, GGML_SCHED_MAX_COPIES = 4

llama_context:  CUDA_Host  output buffer size =     0.50 MiB
llama_kv_cache:      CUDA0 KV buffer size = 12327.00 MiB
llama_kv_cache:      CUDA1 KV buffer size = 11153.00 MiB
llama_kv_cache: size = 23480.00 MiB (150272 cells,  40 layers,  1/1 seqs), K (f16): 11740.00 MiB, V (f16): 11740.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 2904
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 4096, n_seqs = 1, n_outputs = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 12710.31 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 13327728640
graph_reserve: failed to allocate compute buffers
llama_context: compute buffer allocation failed, retrying without pipeline parallelism
ggml_cuda_host_malloc: failed to allocate 2428.11 MiB of pinned memory: resource already mapped
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 2546057216
ggml_gallocr_reserve_n: failed to allocate CUDA_Host buffer of size 2546057216
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers

as expected, compute buffer is too big so original allocation fails -> that is ok
but the reallocation cannot proceed because resource already mapped

@slaren did you encounter that before?

slaren · 2025-12-14T21:12:54Z

@slaren did you encounter that before?

No. As far as I know, the code frees the memory before retrying the allocation without pipeline parallelism. It might be an issue in the CUDA driver.

LostRuins · 2025-12-15T05:49:49Z

Yeah, the user subsequently mentioned

However, I turned that around, went back to UEFI and gave the iGPU 512 MB of VRAM only, while putting the rest back to RAM space. Back in Windows I suddenly can load bigger CTX size into the iGPU ! But not that, I fired up my CUDA multi GPU setup, fiddled a bit with the settings, and managed to pack a whopping 250k CTX size, successfully loading the model.

so it's some shared memory shenanigans likely entirely unrelated to the backend

…ls (ggml-org#16748)

…ls (#16748)

llama : disable pipeline parallelism if compute buffer allocation fails

96cf05c

slaren requested a review from ggerganov as a code owner October 23, 2025 22:59

ggerganov approved these changes Oct 24, 2025

View reviewed changes

slaren merged commit 5a4ff43 into master Oct 27, 2025
72 checks passed

slaren deleted the sl/auto-disable-pipeline-parallelism branch October 27, 2025 20:51

LostRuins mentioned this pull request Dec 11, 2025

Memory allocation on multi GPU setup LostRuins/koboldcpp#1882

Open

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

llama : disable pipeline parallelism if compute buffer allocation fai…

dcc96ee

…ls (ggml-org#16748)

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

llama : disable pipeline parallelism if compute buffer allocation fai…

1b663e6

…ls (#16748)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : disable pipeline parallelism if compute buffer allocation fails#16748

llama : disable pipeline parallelism if compute buffer allocation fails#16748
slaren merged 1 commit intomasterfrom
sl/auto-disable-pipeline-parallelism

slaren commented Oct 23, 2025

Uh oh!

Uh oh!

LostRuins commented Dec 11, 2025

Uh oh!

slaren commented Dec 14, 2025

Uh oh!

LostRuins commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slaren commented Oct 23, 2025

Uh oh!

Uh oh!

LostRuins commented Dec 11, 2025

Uh oh!

slaren commented Dec 14, 2025

Uh oh!

LostRuins commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants