[Draft] feat: Mooncake support layerwise kv cache transfer#19931
[Draft] feat: Mooncake support layerwise kv cache transfer#19931zhangxiaolei123456 wants to merge 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the Mooncake disaggregation system by enabling asynchronous, per-layer KV cache transfer. This change aims to optimize the data transfer pipeline, particularly benefiting models with architectures like GQA and Mamba by allowing KV and state tensors to be transferred as soon as they are ready, rather than waiting for an entire batch. This approach leverages CUDA events for fine-grained synchronization, potentially reducing latency and improving overall throughput in disaggregated inference setups. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous, per-layer KV cache transfer for Mooncake, which is a significant feature for improving performance in disaggregated serving. The implementation uses an asynchronous submission mechanism and hooks into the model's forward pass to trigger layer-wise transfers, which is a solid approach to overlap computation and communication. The code is generally well-structured, but my review includes suggestions to improve robustness and maintainability, particularly around error handling and code duplication.
Note: Security Review did not run due to the size of the PR.
| except Exception as e: | ||
| import traceback | ||
|
|
||
| traceback.print_exc() | ||
| logger.info(f"Error in put_kvcache_thread: {e}") | ||
| import os | ||
|
|
||
| os._exit(1) |
There was a problem hiding this comment.
Using os._exit(1) in a worker thread is unsafe for a server application. It terminates the entire process abruptly, bypassing cleanup handlers, which can lead to resource leaks, corrupted state, and difficult debugging. A more graceful shutdown should be implemented, for example by signaling the main thread. For now, I'll suggest replacing this with proper exception logging to avoid crashing the whole server process on a single thread's error.
| except Exception as e: | |
| import traceback | |
| traceback.print_exc() | |
| logger.info(f"Error in put_kvcache_thread: {e}") | |
| import os | |
| os._exit(1) | |
| except Exception: | |
| logger.exception("Unhandled exception in _put_kvcache_func worker thread.") |
| try: | ||
| import torch | ||
|
|
||
| if torch.cuda.is_available(): | ||
| event.synchronize() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Swallowing exceptions with a broad except Exception: pass is risky. If event.synchronize() fails, it will be silently ignored, which could lead to race conditions or hard-to-debug issues. The exception should be logged to provide visibility into potential problems.
except Exception as e:
logger.warning(f"Failed to synchronize CUDA event: {e}")| for field in vars(mamba_cache): | ||
| if field in ("intermediate_ssm", "intermediate_conv_window"): | ||
| continue | ||
| value = getattr(mamba_cache, field) | ||
| if isinstance(value, list): | ||
| state_tensors.extend(value) | ||
| else: | ||
| state_tensors.append(value) | ||
| self._mamba_state_tensors_per_layer = len(state_tensors) |
There was a problem hiding this comment.
|
/tag-and-rerun-ci |
|
@zhangxiaolei123456 Hello, could you please provide the performance test data? I understand that enabling this feature will reduce transmission time. |
|
I re-ran the PD layerwise-KV pipeline experiment under TCP transport (
For attribution, the runtime measurements are also consistent on the Overall, I think this PR is worth looking into, especially for deployments where only TCP transport is available, since that is exactly where the overlap benefit is the clearest. |
@UNIDY2002 yeah, I agree. Did you test this PR under extreme heavy workload? |
| try: | ||
| if self.is_generation: |
There was a problem hiding this comment.
I'm just not quite sure about this part. worried that it will somehow make the scheduler code hard to read and maintain (and hard to debug? because of the try block here), even when async transfer is not enabled. Is there a cleaner solution to achieve this?
There was a problem hiding this comment.
I agree. I think we need to improve the PR's implementation.
Will test with heavier workloads. |
|
Thanks for the suggestion. We did run an extreme heavy-load comparison with matched config. Setup:
Result (mean TTFT):
So async-ON is consistently better under this heavy long-prompt workload. |
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
These flags are effectively fixed to defaults; remove dead env parsing and stale comments.
Implement layerwise async KV as an optional path in MooncakeKVManager and expose a standard scheduler hook, removing the async-manager subclass and ad-hoc capability checks.
Move the layerwise async Mooncake transfer state and helpers into a dedicated mixin so the base KV manager stays smaller while preserving the validated async overlap path.
Motivation
Co-authored-by: UNIDY2002
Modifications
Models: Qwen3.5
Prefill
SGLANG_ASYNC_KV_MISSING_WAIT_MS=100 SGLANG_ASYNC_KV_GQA_PER_LAYER_EVENT_SYNC=1 SGLANG_ASYNC_KV_MAMBA_PER_LAYER_EVENT_SYNC=1 SGLANG_MOONCAKE_ASYNC_KV=1 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=128 SGLANG_DISAGGREGATION_QUEUE_SIZE=128 python -m sglang.launch_server --model-path /data00/models/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --mem-fraction-static 0.85 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --mamba-ssm-dtype float16 --kv-cache-dtype fp8_e4m3 --disaggregation-mode prefill --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --host 0.0.0.0 --port 30300 --disable-radix-cache --max-running-requests 64 --chunked-prefill-size 0 --max-prefill-tokens 16384 --page-size 64Decode without MTP
SGLANG_ASYNC_KV_GQA_PER_LAYER_EVENT_SYNC=1 SGLANG_ASYNC_KV_MAMBA_PER_LAYER_EVENT_SYNC=1 SGLANG_MOONCAKE_ASYNC_KV=1 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=128 SGLANG_DISAGGREGATION_QUEUE_SIZE=128 python -m sglang.launch_server --model-path /data00/models/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --ep-size 8 --mem-fraction-static 0.75 --context-length 131072 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --cuda-graph-bs 1 8 16 32 64 --max-running-requests 256 --mamba-ssm-dtype float16 --kv-cache-dtype fp8_e4m3 --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --moe-runner-backend deep_gemm --moe-a2a-backend deepep --deepep-mode low_latency --host 0.0.0.0 --port 30300 --enable-metrics --disable-radix-cache --page-size 64Decode with MTP
SGLANG_ASYNC_KV_GQA_PER_LAYER_EVENT_SYNC=1 SGLANG_ASYNC_KV_MAMBA_PER_LAYER_EVENT_SYNC=1 SGLANG_MOONCAKE_ASYNC_KV=1 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=128 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600 SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=128 SGLANG_DISAGGREGATION_QUEUE_SIZE=128 python -m sglang.launch_server --model-path /data00/models/Qwen3.5-397B-A17B-FP8 --port 8000 --tp-size 8 --ep-size 8 --mem-fraction-static 0.75 --context-length 131072 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --cuda-graph-bs 1 8 16 32 64 --max-running-requests 256 --mamba-ssm-dtype float16 --kv-cache-dtype fp8_e4m3 --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --moe-runner-backend deep_gemm --moe-a2a-backend deepep --deepep-mode low_latency --host 0.0.0.0 --port 30300 --enable-metrics --disable-radix-cache --page-size 64 --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4Accuracy Tests
Dataset: gsm8k
PD without MTP
PD with MTP
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci