Skip to content

server : do not default to multiple slots with speculative decoding#17017

Merged
ggerganov merged 2 commits intomasterfrom
server/fix-draft-slots
Nov 5, 2025
Merged

server : do not default to multiple slots with speculative decoding#17017
ggerganov merged 2 commits intomasterfrom
server/fix-draft-slots

Conversation

@ggerganov
Copy link
Member

fix #16980

The current implementation of speculative decoding in the llama-server requires a separate draft llama_context for each slot. Combined with the new defaults from #16736 this results in extra draft contexts being allocated, increasing memory usage.

This PR updates the logic to not increase the default server slots when a draft models is specified.

@pockers21
Copy link
Contributor

Multiple PRs are hitting webgpu-related CI failures , perhaps there is some issue in mater branch code.

@ggerganov ggerganov merged commit 13b339b into master Nov 5, 2025
64 of 71 checks passed
@ggerganov ggerganov deleted the server/fix-draft-slots branch November 5, 2025 12:33
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 5, 2025
* origin/master: (21 commits)
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919)
examples(gguf): GGUF example outputs (ggml-org#17025)
mtmd: allow QwenVL to process larger image by default (ggml-org#17020)
server : do not default to multiple slots with speculative decoding (ggml-org#17017)
mtmd: improve struct initialization (ggml-org#16981)
docs: Clarify the endpoint that webui uses (ggml-org#17001)
model : add openPangu-Embedded (ggml-org#16941)
ggml webgpu: minor set rows optimization (ggml-org#16810)
sync : ggml
ggml : fix conv2d_dw SVE path (ggml/1380)
CUDA: update ops.md (ggml-org#17005)
opencl: update doc (ggml-org#17011)
refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913)
vulkan: remove the need for the dryrun (ggml-org#16826)
server : do context shift only while generating (ggml-org#17000)
readme : update hot topics (ggml-org#17002)
ggml-cpu : bicubic interpolation (ggml-org#16891)
ci : apply model label to models (ggml-org#16994)
chore : fix models indent after refactor (ggml-org#16992)
Fix garbled output with REPACK at high thread counts (ggml-org#16956)
...
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…gml-org#17017)

* server : do not default to multiple slots with speculative decoding

* cont : fix
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
…#17017)

* server : do not default to multiple slots with speculative decoding

* cont : fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Dense model with draft model cause crash

2 participants