Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jan 2, 2026

cont #17617

In some cases we know that a graph reallocation would be necessary (see #17617). Re-reserve the scheduler to reduce the amount of unexpected graph reallocations and to prevent further reallocations later.

Also, the number of active samplers when using backend sampling with llama-server is now properly configured. Before, for a server with N slots, we were always running N samplers, regardless of how many slots are active. Now, thanks to the new reserve logic, we disable the samplers for the inactive slots.

TODOs:

  • Handle backend sampler changes

Base automatically changed from gg/metal-adjust-fa-extra-size to master January 2, 2026 17:02
@ggerganov ggerganov force-pushed the gg/llama-reserve branch 2 times, most recently from cf2b3ca to 4b74410 Compare January 11, 2026 15:49
@ggerganov
Copy link
Member Author

@ngxson PTAL at the server changes when you get the chance. They are relatively minor.

Comment on lines 160 to 161
return type != SERVER_TASK_TYPE_EMBEDDING &&
type != SERVER_TASK_TYPE_RERANK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having the reversed logic here: type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL

Also note that SERVER_TASK_TYPE_INFILL will be removed soon, because it's technically just a completion task with a special chat template

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: ffa0d15

Comment on lines 2598 to 2602
for (auto & slot : slots) {
if (!slot.is_processing() || !slot.smpl) {
llama_set_sampler(ctx, slot.id, nullptr);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be moved to slot.release()? If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea.

If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything

Yes, this prevents from the llama_context adding dummy sampling nodes to the graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to server_slot.reset() in d9146ed

Also a bit of refactoring:

  • Rename server_slot.clear() -> server_slot.prompt_clear()
  • Remove redundant slot.reset() from launch_slot_with_task(). Assumption is that all slots will be reset when they are released, so no need to do it again upon launch

@ggerganov
Copy link
Member Author

Should be good to merge. @ngxson Let me know if you want to take one more look

@ggerganov ggerganov merged commit 39173bc into master Jan 15, 2026
74 of 76 checks passed
@ggerganov ggerganov deleted the gg/llama-reserve branch January 15, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants