[MP][Feat] support worker-affinity in the MQ thread pool#2842
[MP][Feat] support worker-affinity in the MQ thread pool#2842ApostaC merged 5 commits intoLMCache:devfrom
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the multiprocess message queue server's concurrency model by introducing specialized thread pools. The primary goal is to improve performance and prevent deadlocks in GPU-bound operations by ensuring that requests from a single vLLM instance are processed sequentially on a dedicated worker thread. This change enhances the server's ability to manage diverse workloads more efficiently and provides finer-grained control over resource allocation through new configuration options. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant architectural improvement by implementing an affinity-based thread pool for GPU-bound operations. This elegantly removes the need for a transfer_lock, which should improve concurrency and performance. The changes are well-structured, separating the new AffinityThreadPool into its own module and cleanly refactoring the MessageQueueServer to use distinct pools for GPU and CPU tasks. The removal of the default thread pool in favor of explicit assignment is a strong design choice that enhances clarity and robustness. The accompanying documentation and test updates are thorough. My feedback includes suggestions to add validation for worker counts to improve robustness against invalid user input.
| base = args.max_workers | ||
| max_gpu = args.max_gpu_workers if args.max_gpu_workers is not None else base | ||
| max_cpu = args.max_cpu_workers if args.max_cpu_workers is not None else base |
There was a problem hiding this comment.
The argument parsing logic allows for non-positive values for worker counts (e.g., --max-workers 0), which will cause a ValueError when creating the thread pools. It would be more user-friendly to validate these values here and provide a specific error message.
base = args.max_workers
max_gpu = args.max_gpu_workers if args.max_gpu_workers is not None else base
max_cpu = args.max_cpu_workers if args.max_cpu_workers is not None else base
if base <= 0:
raise ValueError(f"--max-workers must be a positive integer, but got {base}")
if max_gpu <= 0:
raise ValueError(f"Resolved GPU worker count must be positive, but got {max_gpu}")
if max_cpu <= 0:
raise ValueError(f"Resolved CPU worker count must be positive, but got {max_cpu}")Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
* [add] thread affinity executor * Add initialization and docs * deprecate old blend server
* [add] thread affinity executor * Add initialization and docs * deprecate old blend server
* [add] thread affinity executor * Add initialization and docs * deprecate old blend server
* [add] thread affinity executor * Add initialization and docs * deprecate old blend server
What this PR does / why we need it:
Introduces thread affinity routing for GPU-bound request handlers in the multiprocess message queue server. All STORE/RETRIEVE requests from the same vLLM instance (identified by zmq identity) are always dispatched to the same worker thread. This eliminates the need for
gpu_context.transfer_locksince same-instance GPU transfers are now inherently serialized.Also cleans up pool naming:
hash(zmq_identity) % Nso same-client work is serialized on one thread.ThreadPoolExecutor.New CLI args (backward compatible):
--max-workerssets both pools (existing behavior preserved)--max-gpu-workersoverrides the affinity pool size--max-cpu-workersoverrides the normal pool sizeSpecial notes for your reviewers:
blend_server.pyis deprecated but updated minimally to avoid breakageMPCacheEngine.lock(used only inclear()) is unrelated and left as-isIf applicable: