[Bug] KIMI-K2.5 can't use context parallel

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Describe the bug

Deploying kimi-k2.5 with context parallelism will result in an error.
```
[2026-04-13 10:35:43 PP1 ATTN_CP2 TP2] Decode batch, #running-req: 1, #token: 25088, token usage: 0.02, cuda graph: True, gen throughput (token/s): 21.39, #queue-req: 0
[2026-04-13 10:35:43 PP1 ATTN_CP3 TP3] Decode batch, #running-req: 1, #token: 25088, token usage: 0.02, cuda graph: True, gen throughput (token/s): 21.39, #queue-req: 0
[2026-04-13 10:35:43 PP1 ATTN_CP6 TP6] Decode batch, #running-req: 1, #token: 25088, token usage: 0.02, cuda graph: True, gen throughput (token/s): 21.38, #queue-req: 0
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel:112: block: [0: _assert_async_cuda_kernel,0,0: block: [0], thread: [0,0,0,0,0], thread: [0] Assertion `probability tensor contains either `inf`, `nan` or element < 0,0` failed.
,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:112: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
[2026-04-13 10:35:45 PP1 ATTN_CP4 TP4] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3495, in dispatch_event_loop
    scheduler.event_loop_pp()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 122, in event_loop_pp
    d2h_event.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-04-13 10:35:45 PP1 ATTN_CP0 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3495, in dispatch_event_loop
    scheduler.event_loop_pp()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 122, in event_loop_pp
    d2h_event.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-04-13 10:35:45 PP1 ATTN_CP6 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3495, in dispatch_event_loop
    scheduler.event_loop_pp()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 122, in event_loop_pp
    d2h_event.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


[2026-04-13 10:35:45 PP1 ATTN_CP7 TP7] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3495, in dispatch_event_loop
    scheduler.event_loop_pp()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 122, in event_loop_pp
    d2h_event.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


terminate called after throwing an instance of 'c10::AcceleratorError'
terminate called after throwing an instance of 'c10::AcceleratorError'
terminate called after throwing an instance of 'c10::AcceleratorError'
terminate called after throwing an instance of 'c10::AcceleratorError'
[2026-04-13 10:35:45 PP1 ATTN_CP1 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3616, in run_scheduler_process
    scheduler.run_event_loop()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1300, in run_event_loop
    dispatch_event_loop(self)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3495, in dispatch_event_loop
    scheduler.event_loop_pp()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 122, in event_loop_pp
    d2h_event.synchronize()
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


  what():  CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2a8f57cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f2a8f90efb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1db4e (0x7f2a8f91ab4e in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1fac2 (0x7f2a8f91cac2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x4827af (0x7f2a8130b7af in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2a8f559d69 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #6: <unknown function> + 0x7cb658 (0x7f2a81654658 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x7cb9c5 (0x7f2a816549c5 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x575eae]
frame #9: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x575bfc]
frame #10: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x579982]
frame #11: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x59edf9]
frame #12: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x558da1]
frame #13: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x610665]
frame #14: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x610675]
frame #15: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x610675]
frame #16: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x610675]
frame #17: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x610675]
frame #18: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x55331b]
frame #19: sglang::scheduler_PP1_ATTN_CP4_TP4() [0x59ef53]
frame #20: _PyEval_EvalFrameDefault + 0x502d (0x5db4ad in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #21: PyEval_EvalCode + 0x15b (0x5d543b in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #22: PyRun_StringFlags + 0xd3 (0x6084b3 in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #23: PyRun_SimpleStringFlags + 0x3e (0x6b3d0e in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #24: Py_RunMain + 0x481 (0x6bc9d1 in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #25: Py_BytesMain + 0x2d (0x6bc3ed in sglang::scheduler_PP1_ATTN_CP4_TP4)
frame #26: <unknown function> + 0x2a1ca (0x7f2c11d9a1ca in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: __libc_start_main + 0x8b (0x7f2c11d9a28b in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #28: _start + 0x25 (0x6576c5 in sglang::scheduler_PP1_ATTN_CP4_TP4)

Fatal Python error: Aborted

Thread 0x00007efce3fff6c0 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/utils/watchdog.py", line 147 in _watchdog_once
  File "/sgl-workspace/sglang/python/sglang/srt/utils/watchdog.py", line 127 in _watchdog_thread
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007efceffff6c0 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 1047 in backup_thread_func
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
terminate called after throwing an instance of '  File "/usr/lib/python3.12/threading.pyc10::AcceleratorError", line '
1030 in _bootstrap

Thread 0x00007efcfbffd6c0 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 886 in prefetch_io_aux_func
  File "/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007efcffffe6c0 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 950 in prefetch_thread_func
  File "[2026-04-13 10:35:45] Received sigquit from a child process. It usually means the child failed.
/usr/lib/python3.12/threading.py", line 1010 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007f15e7fff6c0 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Current thread 0x00007f2c11d6f300 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3621 in run_scheduler_process
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>
  what():  CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fd1a4b91b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7fd21ef66fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1db4e (0x7fd21ef72b4e in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1fac2 (0x7fd21ef74ac2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x4827af (0x7fd210d0b7af in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fd1a4b6ed69 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #6: <unknown function> + 0x7cb658 (0x7fd211054658 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x7cb9c5 (0x7fd2110549c5 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x575eae]
frame #9: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x575bfc]
frame #10: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x579982]
frame #11: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x59edf9]
frame #12: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x558da1]
frame #13: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x610665]
frame #14: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x610675]
frame #15: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x610675]
frame #16: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x610675]
frame #17: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x610675]
frame #18: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x55331b]
frame #19: sglang::scheduler_PP1_ATTN_CP7_TP7() [0x59ef53]
frame #20: _PyEval_EvalFrameDefault + 0x502d (0x5db4ad in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #21: PyEval_EvalCode + 0x15b (0x5d543b in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #22: PyRun_StringFlags + 0xd3 (0x6084b3 in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #23: PyRun_SimpleStringFlags + 0x3e (0x6b3d0e in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #24: Py_RunMain + 0x481 (0x6bc9d1 in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #25: Py_BytesMain + 0x2d (0x6bc3ed in sglang::scheduler_PP1_ATTN_CP7_TP7)
frame #26: <unknown function> + 0x2a1ca (0x7fd3a16b61ca in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: __libc_start_main + 0x8b (0x7fd3a16b628b in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #28: _start + 0x25 (0x6576c5 in sglang::scheduler_PP1_ATTN_CP7_TP7)

Fatal Python error: Aborted
```

### Reproduction

start command
```
              python3 -m sglang.launch_server \
              --model /models/Kimi-K2.5  \
              --dist-init-addr $LWS_LEADER_ADDRESS:20000 \
              --tensor-parallel-size 8 \
              --pp-size 2 \
              --nnodes $LWS_GROUP_SIZE  \
              --node-rank $LWS_WORKER_INDEX \
              --trust-remote-code \
              --host 0.0.0.0 \
              --port 8000 \
              --dist-timeout 7200 \
              --enable-metrics \
              --reasoning-parser kimi_k2 \
              --tool-call-parser kimi_k2 \
              --mem-fraction-static 0.85 \
              --log-requests --log-requests-leve 1 \
              --kv-cache-dtype fp8_e4m3 \
              --enable-hierarchical-cache \
              --hicache-ratio 1 \
              --hicache-write-policy write_through \
              --hicache-storage-backend mooncake \
              --page-size 64 \
              --served-model-name kimi-2.5 \
              --enable-cache-report \
              --allow-auto-truncate \
              --preferred-sampling-params '{"max_new_tokens": 8192}' \
              --dp-size 1 --moe-dense-tp-size 1 \
              --attn-cp-size 8 \
              --enable-prefill-context-parallel
```


### Environment

```
Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.124.06
PyTorch: 2.9.1+cu129
sglang: 0.5.10
sglang-kernel: 0.4.1
flashinfer_python: 0.6.7.post2
flashinfer_cubin: 0.6.7.post2
flashinfer_jit_cache: 0.6.7.post2+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.9.0
interegular: 0.3.3
modelscope: 1.35.3
orjson: 3.11.8
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.43.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.89.0
litellm: Module Not Found
torchcodec: 0.9.1
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     PIX     NODE    NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PIX     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    48-95,144-191   1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS     SYS     SYS
NIC3    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE
NIC5    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7


```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] KIMI-K2.5 can't use context parallel #22692

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] KIMI-K2.5 can't use context parallel #22692

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions