-
Notifications
You must be signed in to change notification settings - Fork 14.5k
context : reserve new scheduler when graph topology changes #18547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
400466c to
bd5de6b
Compare
89d19e0 to
c92df39
Compare
cf2b3ca to
4b74410
Compare
4b74410 to
b579b97
Compare
|
@ngxson PTAL at the server changes when you get the chance. They are relatively minor. |
tools/server/server-task.h
Outdated
| return type != SERVER_TASK_TYPE_EMBEDDING && | ||
| type != SERVER_TASK_TYPE_RERANK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having the reversed logic here: type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL
Also note that SERVER_TASK_TYPE_INFILL will be removed soon, because it's technically just a completion task with a special chat template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed here: ffa0d15
tools/server/server-context.cpp
Outdated
| for (auto & slot : slots) { | ||
| if (!slot.is_processing() || !slot.smpl) { | ||
| llama_set_sampler(ctx, slot.id, nullptr); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be moved to slot.release()? If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea.
If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything
Yes, this prevents from the llama_context adding dummy sampling nodes to the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to server_slot.reset() in d9146ed
Also a bit of refactoring:
- Rename
server_slot.clear()->server_slot.prompt_clear() - Remove redundant
slot.reset()fromlaunch_slot_with_task(). Assumption is that all slots will be reset when they are released, so no need to do it again upon launch
|
Should be good to merge. @ngxson Let me know if you want to take one more look |
cont #17617
In some cases we know that a graph reallocation would be necessary (see #17617). Re-reserve the scheduler to reduce the amount of unexpected graph reallocations and to prevent further reallocations later.
Also, the number of active samplers when using backend sampling with
llama-serveris now properly configured. Before, for a server with N slots, we were always running N samplers, regardless of how many slots are active. Now, thanks to the new reserve logic, we disable the samplers for the inactive slots.TODOs: