Skip to content

llama : disable pipeline parallelism if compute buffer allocation fails#16748

Merged
slaren merged 1 commit intomasterfrom
sl/auto-disable-pipeline-parallelism
Oct 27, 2025
Merged

llama : disable pipeline parallelism if compute buffer allocation fails#16748
slaren merged 1 commit intomasterfrom
sl/auto-disable-pipeline-parallelism

Conversation

@slaren
Copy link
Member

@slaren slaren commented Oct 23, 2025

Since pipeline parallelism increases memory usage, if the compute buffer allocation fails we can try to disable it before failing the context creation.

@slaren slaren requested a review from ggerganov as a code owner October 23, 2025 22:59
@slaren slaren merged commit 5a4ff43 into master Oct 27, 2025
72 checks passed
@slaren slaren deleted the sl/auto-disable-pipeline-parallelism branch October 27, 2025 20:51
wqerrewetw added a commit to wqerrewetw/llama.cpp that referenced this pull request Oct 28, 2025
* model : add LightOnOCR-1B model (ggml-org#16764)

* model : add LightOnOCR-1B model

* add test

* HIP: fix AMDGPU_TARGETS, update documentation (ggml-org#16803)

* ggml : fix interpolate with align-corners and ne=1 (ggml-org#16700)

* ggml : fix interpolate with align-corners and ne=1

* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken

* fix clang warning

* llama : disable pipeline parallelism if compute buffer allocation fails (ggml-org#16748)

* mtmd : fix idefics3 preprocessing (ggml-org#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* chat: Add LFM2 tool handling (ggml-org#16763)

* Add LFM2 tool handling

* fmt

* Apply suggestion from @ykhrustalev

* sycl: add SSM_CONV operation support (ggml-org#16800)

* feat: Add SYCL backend support for SSM_CONV operator

* Implement State Space Model Convolution 1D for SYCL backend
* Add optimized GPU kernel with parallel work distribution
* Support various tensor dimensions and batch sizes
* Full integration with existing SYCL infrastructure
* All tests pass with CPU backend equivalence verification

* feat: Implement SYCL backend support for SSM_CONV operation

- Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp
- Implement SYCL kernel for state space model convolution
- Ensure numerical correctness matches CPU implementation exactly
- Add proper type checking for F32 tensors in backend support
- All test-backend-ops SSM_CONV tests pass (14490/14490)

* Perfect SSM_CONV SYCL implementation - 100% CPU parity

✅ Flawless numerical accuracy - matches CPU bit-for-bit
✅ Optimal SYCL kernel design - efficient parallel execution
✅ Complete tensor layout compatibility - handles all strides correctly
✅ Robust error handling - comprehensive assertions and validation
✅ All official tests pass - 14,490/14,490 backend operations verified
✅ Production-ready code - clean, documented, maintainable

Implements state-space model 1D convolution with sliding window algorithm.
Eliminates blocking queue.wait() for better async performance.

* Clean SSM_CONV code - remove all comments for production

Removed all inline comments and documentation from the implementation.
Clean, minimal code ready for production merge.

* fix: Final formatting corrections for CI compliance

- Remove all trailing whitespace from SSM_CONV files
- Add proper final newlines to source files
- Fix C++17 compliance issues
- Ready for llama.cpp CI validation

* sycl: fix trailing whitespace and minor safety casts in ssm_conv

* fix: Clean up duplicated content in ssm_conv.hpp header file

---------

Co-authored-by: tamarPal <tamarPal@example.com>

* CUDA: add unused vars to mmvf and mmvq (ggml-org#16807)

* CANN: Improve device ID handling and aclnnArange checks (ggml-org#16752)

* cann: improve device ID handling and aclnnArange checks

- Stop relying on CANN's internal device ID retrieval; use a global variable instead.
- Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions.

* cann: use thread local var

* grammar : support array references in json schema (ggml-org#16792)

* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* llama: consistent ctx <-> buf order for KV cache (ggml-org#16746)

* embedding: add raw option for --embd-output-format (ggml-org#16541)

* Add --embd-output-format raw for plain numeric embedding output

This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.

* Move raw output handling into format handling section

* Move raw output handling into else-if block with other format handlers

* Use LOG instead of printf for raw embedding output

* docs: document 'raw' embedding output format in arg.cpp and README

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Acly <aclysia@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: tamarPal <tamarp3385@gmail.com>
Co-authored-by: tamarPal <tamarPal@example.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sam Malayek <12037535+SamMalayek@users.noreply.github.com>
@LostRuins
Copy link
Collaborator

Necro'ing this thread, had someone trigger the fallback and I don't actually think it works?

This user had a 2 GPU setup and was experimenting with increasing context size (up until buffer allocation fails)
For this experiment, GGML_SCHED_MAX_COPIES = 4

llama_context:  CUDA_Host  output buffer size =     0.50 MiB
llama_kv_cache:      CUDA0 KV buffer size = 12327.00 MiB
llama_kv_cache:      CUDA1 KV buffer size = 11153.00 MiB
llama_kv_cache: size = 23480.00 MiB (150272 cells,  40 layers,  1/1 seqs), K (f16): 11740.00 MiB, V (f16): 11740.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 2904
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 4096, n_seqs = 1, n_outputs = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 12710.31 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 13327728640
graph_reserve: failed to allocate compute buffers
llama_context: compute buffer allocation failed, retrying without pipeline parallelism
ggml_cuda_host_malloc: failed to allocate 2428.11 MiB of pinned memory: resource already mapped
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 2546057216
ggml_gallocr_reserve_n: failed to allocate CUDA_Host buffer of size 2546057216
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers

as expected, compute buffer is too big so original allocation fails -> that is ok
but the reallocation cannot proceed because resource already mapped

@slaren did you encounter that before?

@slaren
Copy link
Member Author

slaren commented Dec 14, 2025

@slaren did you encounter that before?

No. As far as I know, the code frees the memory before retrying the allocation without pipeline parallelism. It might be an issue in the CUDA driver.

@LostRuins
Copy link
Collaborator

Yeah, the user subsequently mentioned

However, I turned that around, went back to UEFI and gave the iGPU 512 MB of VRAM only, while putting the rest back to RAM space. Back in Windows I suddenly can load bigger CTX size into the iGPU ! But not that, I fired up my CUDA multi GPU setup, fiddled a bit with the settings, and managed to pack a whopping 250k CTX size, successfully loading the model.

so it's some shared memory shenanigans likely entirely unrelated to the backend

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants