context : reserve new scheduler when graph topology changes #18547

ggerganov · 2026-01-02T13:31:30Z

In some cases we know that a graph reallocation would be necessary (see #17617). Re-reserve the scheduler to reduce the amount of unexpected graph reallocations and to prevent further reallocations later.

Also, the number of active samplers when using backend sampling with llama-server is now properly configured. Before, for a server with N slots, we were always running N samplers, regardless of how many slots are active. Now, thanks to the new reserve logic, we disable the samplers for the inactive slots.

TODOs:

Handle backend sampler changes

ggerganov · 2026-01-12T14:48:04Z

@ngxson PTAL at the server changes when you get the chance. They are relatively minor.

ngxson · 2026-01-12T17:28:55Z

tools/server/server-task.h

+        return type != SERVER_TASK_TYPE_EMBEDDING &&
+               type != SERVER_TASK_TYPE_RERANK;


I would prefer having the reversed logic here: type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL

Also note that SERVER_TASK_TYPE_INFILL will be removed soon, because it's technically just a completion task with a special chat template

Addressed here: ffa0d15

ngxson · 2026-01-12T17:32:06Z

tools/server/server-context.cpp

+        for (auto & slot : slots) {
+            if (!slot.is_processing() || !slot.smpl) {
+                llama_set_sampler(ctx, slot.id, nullptr);
+            }
+        }


should this be moved to slot.release()? If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything

Yes, good idea.

If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything

Yes, this prevents from the llama_context adding dummy sampling nodes to the graph.

I moved it to server_slot.reset() in d9146ed

Also a bit of refactoring:

Rename server_slot.clear() -> server_slot.prompt_clear()

Remove redundant slot.reset() from launch_slot_with_task(). Assumption is that all slots will be reset when they are released, so no need to do it again upon launch

ggerganov · 2026-01-15T11:22:01Z

Should be good to merge. @ngxson Let me know if you want to take one more look

ggerganov mentioned this pull request Jan 2, 2026

graph : reduce topology branching #18548

Merged

danbev approved these changes Jan 2, 2026

View reviewed changes

Base automatically changed from gg/metal-adjust-fa-extra-size to master January 2, 2026 17:02

ggerganov force-pushed the gg/llama-reserve branch from 400466c to bd5de6b Compare January 2, 2026 17:03

ggerganov mentioned this pull request Jan 2, 2026

ggml : add ggml_build_forward_select #18550

Open

4 tasks

loci-dev mentioned this pull request Jan 2, 2026

UPSTREAM PR #18547: context : reserve new scheduler when graph topology changes auroralabs-loci/llama.cpp#792

Open

ggerganov force-pushed the gg/llama-reserve branch from 89d19e0 to c92df39 Compare January 4, 2026 09:12

ggerganov mentioned this pull request Jan 4, 2026

sampling : add support for backend sampling #17004

Merged

31 tasks

ggerganov force-pushed the gg/llama-reserve branch 2 times, most recently from cf2b3ca to 4b74410 Compare January 11, 2026 15:49

ggerganov added 7 commits January 12, 2026 16:37

context : reserve new scheduler when graph topology changes

e115c63

cont : fix

7b52642

cont : fix reserve

94426b2

cont : reserve only when changes occur + timing

03e9d66

context : add comments

5260bb7

llama : reserve on sampler changes

0c0d0fd

common : allow null common_sampler

b579b97

ggerganov force-pushed the gg/llama-reserve branch from 4b74410 to b579b97 Compare January 12, 2026 14:37

ggerganov requested a review from ngxson as a code owner January 12, 2026 14:37

github-actions bot added examples server labels Jan 12, 2026

ngxson reviewed Jan 12, 2026

View reviewed changes

ggerganov added 4 commits January 14, 2026 11:49

server : task declares needs (embd, logits, sampling)

ffa0d15

server : do not init sampler if not needed

be9e6ef

llama : fix need_reserve when unsetting a sampler

3084bfe

server : consolidate slot reset/clear logic

d9146ed

ngxson approved these changes Jan 15, 2026

View reviewed changes

ggerganov merged commit 39173bc into master Jan 15, 2026
74 of 76 checks passed

ggerganov deleted the gg/llama-reserve branch January 15, 2026 14:39

ggerganov mentioned this pull request Jan 15, 2026

context : do not reserve scheduler for warmups #18867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

context : reserve new scheduler when graph topology changes #18547

context : reserve new scheduler when graph topology changes #18547

ggerganov commented Jan 2, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jan 12, 2026

Uh oh!

ngxson Jan 12, 2026

Uh oh!

ggerganov Jan 14, 2026

Uh oh!

ngxson Jan 12, 2026

Uh oh!

ggerganov Jan 12, 2026

Uh oh!

ggerganov Jan 14, 2026

Uh oh!

ggerganov commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return type != SERVER_TASK_TYPE_EMBEDDING &&
		type != SERVER_TASK_TYPE_RERANK;

context : reserve new scheduler when graph topology changes #18547

context : reserve new scheduler when graph topology changes #18547

Conversation

ggerganov commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 12, 2026

Uh oh!

ngxson Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Jan 2, 2026 •

edited

Loading