Skip to content

Support l3 cache (mooncake store) for hiradix cache#7211

Merged
xiezhq-hermann merged 111 commits intosgl-project:mainfrom
AniZpZ:support_L3_cache_for_hiredix_cache
Jul 31, 2025
Merged

Support l3 cache (mooncake store) for hiradix cache#7211
xiezhq-hermann merged 111 commits intosgl-project:mainfrom
AniZpZ:support_L3_cache_for_hiredix_cache

Conversation

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@huangtingwei9988 huangtingwei9988 commented Jun 15, 2025

Motivation

Based on hiradix cache, we implemented L3 cache with mooncake store as the backend. For DeepSeek-R1 (tp=8) model, the 4k input hit L3 cache TTFT has been improved by nearly 50%.

Similar to LMCache, mooncake store can share kv cache across sglang instances

relate issue #6836

Modifications

todo list:

  • Tensor->bytes zero copy optimization
  • More detailed benchmark data
  • SSD offload cache
  • Better l3 cache load and write implementation

mooncake_store.json

{
        "local_hostname":"localhost",
        "metadata_server":"P2PHANDSHAKE",
        "master_server_address":"127.0.0.1:50051",
        "protocol":"rdma",
  }

launch sglang server

export MOONCAKE_CONFIG_PATH=/path/to/mooncake_store.json && python -m sglang.launch_server --host 0.0.0.0 --dtype auto --mem-fraction-static 0.93 --tp-size 8 --max-running-requests 2 --trust-remote-code --enable-cache-report --log-level info --context-length 65536 --quantization fp8 --enable-torch-compile --cuda-graph-max-bs 1 --torch-compile-max-bs 1 --model-path /path/to/DeepSeek-R1 --port 8188 --enable-hierarchical-cache --hicache-ratio 2 --enable-mooncake-store-l3-cache --attention-backend flashinfer --page-size 64

mooncake store relate: pr428 issues380 pr511

This PR is not yet complete, we will test cache on SSD & 3fs Offload later

Co-author: @zhangzuo21 @zhaoyongke @xinranwang17 @AniZpZ

Checklist

@xiezhq-hermann xiezhq-hermann merged commit d904959 into sgl-project:main Jul 31, 2025
127 of 169 checks passed
huangzhilin-hzl pushed a commit to huangzhilin-hzl/sglang that referenced this pull request Aug 1, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: zuoyuan <zhangzuo21@mails.tsinghua.edu.cn>
Co-authored-by: @wangyueneng.wyn <wangyueneng.wyn@antgroup.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
@wqlxx
Copy link
Copy Markdown

wqlxx commented Aug 1, 2025

Is there a plan to support decode to store kv and load kv? @xiezhq-hermann @huangtingwei9988

@ykwd
Copy link
Copy Markdown
Contributor

ykwd commented Aug 1, 2025

Is there a plan to support decode to store kv and load kv? @xiezhq-hermann @huangtingwei9988

We have a roadmap for that. You're very welcome to share your ideas and feedback here: #8210

TianQiLin666666 pushed a commit to TianQiLin666666/sglang that referenced this pull request Aug 1, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: zuoyuan <zhangzuo21@mails.tsinghua.edu.cn>
Co-authored-by: @wangyueneng.wyn <wangyueneng.wyn@antgroup.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
@wqlxx
Copy link
Copy Markdown

wqlxx commented Aug 2, 2025

mooncake_store.json can change metadata_server to redis? Now is P2PHANDSHAKE.

@stmatengss
Copy link
Copy Markdown
Collaborator

mooncake_store.json can change metadata_server to redis? Now is P2PHANDSHAKE.

To set up a centralized metadata server, use a prefix like redis/, etcd/, or http/ followed by the metadata server's IP address and port in the metadata_server configuration. Note that the current mooncake release only supports the HTTP metadata server. To use a Redis metadata server, you need to build the mooncake project from source.

@wqlxx
Copy link
Copy Markdown

wqlxx commented Aug 2, 2025

mooncake_store.json can change metadata_server to redis? Now is P2PHANDSHAKE.

To set up a centralized metadata server, use a prefix like redis/, etcd/, or http/ followed by the metadata server's IP address and port in the metadata_server configuration. Note that the current mooncake release only supports the HTTP metadata server. To use a Redis metadata server, you need to build the mooncake project from source.

@stmatengss Thank, I will try to rebuild mooncake to use redis

@wqlxx
Copy link
Copy Markdown

wqlxx commented Aug 5, 2025

If i use dp=8 to run deepseek r1 with hicache and mooncake store. The request will send to different dp group. The hit rate of the cache is very low. My mooncake_store.json is

{
        "local_hostname":"172.16.106.102",
        "metadata_server":"172.16.106.102:2379",
        "master_server_address":"172.16.106.102:50051",
        "protocol":"rdma",
        "device_name":"mlx5_4",
        "local_buffer_size":32212254720,
        "global_segment_size":5368709120
}

run sglang with

--enable-hierarchical-cache \
--hicache-ratio 10 \
--hicache-write-policy "write_back" \
--hicache-io-backend "direct" \
--hicache-mem-layout "layer_first" \
--hicache-storage-backend "mooncake" \

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

If i use dp=8 to run deepseek r1 with hicache and mooncake store. The request will send to different dp group. The hit rate of the cache is very low. My mooncake_store.json is

{
        "local_hostname":"172.16.106.102",
        "metadata_server":"172.16.106.102:2379",
        "master_server_address":"172.16.106.102:50051",
        "protocol":"rdma",
        "device_name":"mlx5_4",
        "local_buffer_size":32212254720,
        "global_segment_size":5368709120
}

run sglang with

--enable-hierarchical-cache \
--hicache-ratio 10 \
--hicache-write-policy "write_back" \
--hicache-io-backend "direct" \
--hicache-mem-layout "layer_first" \
--hicache-storage-backend "mooncake" \

Hi @wqlxx you might want to check out the router for DP: https://docs.sglang.ai/router/router.html

@skyCreateXian
Copy link
Copy Markdown

skyCreateXian commented Aug 8, 2025

Startup crash occurred

MOONCAKE_MASTER=10.94.16.2:50051 MOONCAKE_PROTOCOL="rdma" \
MOONCAKE_DEVICE="mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4" \
MOONCAKE_LOCAL_BUFFER_SIZE=25769803776 \
MOONCAKE_GLOBAL_SEGMENT_SIZE=25769803776 \
LOCAL_HOSTNAME=10.94.16.2 \
MOONCAKE_TE_META_DATA_SERVER=P2PHANDSHAKE python \
    -m sglang.launch_server \
    --host 0.0.0.0 \
    --dtype auto --mem-fraction-static 0.93 \
    --tp-size 1 \
    --max-running-requests 2 \
    --trust-remote-code \
    --enable-cache-report \
    --log-level info \
    --context-length 65536 \
    --enable-torch-compile \
    --cuda-graph-max-bs 1 \
    --torch-compile-max-bs 1 \
    --model-path /path/to/Qwen2-7B \
    --port 8188 \
    --enable-hierarchical-cache \
    --hicache-ratio 2 \
    --hicache-write-policy "write_back" \
    --hicache-mem-layout "layer_first" \
    --hicache-storage-backend "mooncake" \
    --page-size 64

It seems that the mooncake store failed to put, reason for failure -800

I0808 06:16:19.779994 12301 client.cpp:189] transport_type=rdma
I0808 06:16:19.792959 12301 rdma_context.cpp:423] Find best gid index: 3 on mlx5_0/
I0808 06:16:19.795940 12301 rdma_context.cpp:125] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:5e:10:02
I0808 06:16:19.799877 12301 rdma_context.cpp:423] Find best gid index: 3 on mlx5_1/
I0808 06:16:19.802433 12301 rdma_context.cpp:125] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:1a:0a:a1:32
I0808 06:16:19.806269 12301 rdma_context.cpp:423] Find best gid index: 3 on mlx5_2/
I0808 06:16:19.808501 12301 rdma_context.cpp:125] RDMA device: mlx5_2, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:1a:0a:a5:32
I0808 06:16:19.813170 12301 rdma_context.cpp:423] Find best gid index: 3 on mlx5_3/
I0808 06:16:19.815325 12301 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:1a:0a:a9:32
I0808 06:16:19.819001 12301 rdma_context.cpp:423] Find best gid index: 3 on mlx5_4/
I0808 06:16:19.820580 12301 rdma_context.cpp:125] RDMA device: mlx5_4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:1a:0a:ad:32
I0808 06:16:25.002265 12301 store_py.cpp:249] Mounting segment: 25769803776 bytes, 25769803776 of 25769803776
[2025-08-08 06:16:30] Connect to Mooncake store successfully.
E0808 06:16:30.926716 12301 transfer_metadata_plugin.cpp:835] SocketHandShakePlugin: connect()10.94.16.2:13098: Connection refused [111]
E0808 06:16:30.926792 12301 transfer_task.cpp:469] Failed to open segment 10.94.16.2:13098
E0808 06:16:30.926802 12301 client.cpp:1112] Failed to submit transfer operation
E0808 06:16:30.926898 12301 store_py.cpp:392] Put operation failed with error: TRANSFER_FAIL
[2025-08-08 06:16:30] An error occurred while loading the configuration: 
[2025-08-08 06:16:30] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2531, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 393, in __init__
    self.init_memory_pool_and_cache()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 599, in init_memory_pool_and_cache
    self.tree_cache = HiRadixCache(
                      ^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 76, in __init__
    self.cache_controller = HiCacheController(
                            ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 256, in __init__
    self.storage_backend = MooncakeStore()
                           ^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 129, in __init__
    self.warmup()
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 144, in warmup
    assert self.store.is_exist(warmup_key) == 1
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

@xiaguan
Copy link
Copy Markdown
Contributor

xiaguan commented Aug 8, 2025

@skyCreateXian it seems p2phandshake did not success. You could refer to this (https://kvcache-ai.github.io/Mooncake/mooncake-store-api/python-binding.html)to change to http metadata (included in wheel), and change MOONCAKE_TE_META_DATA_SERVER.

By the way, there are some performance optimization pr on the way(#8651 (comment))

@ykwd
Copy link
Copy Markdown
Contributor

ykwd commented Aug 8, 2025

I0808 07:37:38.205292 81860 store_py.cpp:249] Mounting segment: 25769803776 bytes, 25769803776 of 25769803776
E0808 07:37:42.245995 81860 transfer_metadata_plugin.cpp:835] SocketHandShakePlugin: connect()10.94.16.2:12770: Connection refused [111]
E0808 07:37:42.246037 81860 transfer_task.cpp:469] Failed to open segment 10.94.16.2:12770

@skyCreateXian Thanks for reporting the issue.

From the error log, it seems the handshake failed with a “Connection refused” message, which typically indicates one of the following potential causes:

  1. Network connectivity issue – the target IP/port may be unreachable from the current process.
  2. Metadata server not started – if you're using a metadata server, please double-check that it has been properly started.
  3. Incorrect IP configuration – please confirm that the IP addresses and ports configured are correct and consistent across nodes.

@SzymonOzog
Copy link
Copy Markdown
Contributor

I'm having issues where the worker would crash on me when trying to store to cache:
Logs:

root@gh-3714u06:/sgl-workspace/sglang# mooncake_master
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0811 14:17:16.430603 10307 master.cpp:78] Master service started on port 50051, enable_gc=0, max_threads=4, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.1, eviction_high_watermark_ratio=1, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=0, rpc_port=0, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, cluster_id=mooncake_cluster, memory_allocator=offset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I0811 14:17:16.446893 10307 rpc_service.cpp:180] HTTP metrics server started on port 9003
I0811 14:17:16.447414 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:17:26.447532 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:17:36.447654 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:17:46.447767 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:17:56.447888 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:06.447997 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:16.448112 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:26.448225 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:36.448333 10318 rpc_service.cpp:48] Master Metrics: Storage: 0.00 B / 0.00 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): Put=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:46.448449 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0811 14:18:56.448566 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                        I0811 14:19:06.448662 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                        I0811 14:19:16.448770 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                        I0811 14:19:26.448865 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                        I0811 14:19:36.448958 10318 rpc_service.cpp:48] Master Metrics: Storage: 10.00 MB / 4.00 GB (0.2%) | Keys: 1 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                        I0811 14:19:46.449074 10318 rpc_service.cpp:48] Master Metrics: Storage: 906.00 MB / 4.00 GB (22.1%) | Keys: 458753 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B                                                                                                                                                                                                                                 I0811 14:19:56.449198 10318 rpc_service.cpp:48] Master Metrics: Storage: 906.00 MB / 4.00 GB (22.1%) | Keys: 458753 (soft-pinned: 0) | Requests (Success/Total): Put=2/2, Get=1/1, Exist=1/1, Del=0/1, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
    ```
    
    ```
    root@gh-3714u06:/sgl-workspace/sglang# python -m mooncake.http_metadata_server                                                                                                                                                                                                                                                                                                                                                                                                                            2025-08-11 14:17:18,331 - root - INFO - HTTP Metadata Server started on 0.0.0.0:8080
2025-08-11 14:18:43,824 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "PUT /metadata?key=mooncake%2Frpc_meta%2Flocalhost%3A12625 HTTP/1.1" 200 176 "-" "-"
2025-08-11 14:18:43,827 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "PUT /metadata?key=mooncake%2Fram%2Flocalhost%3A12625 HTTP/1.1" 200 176 "-" "-"
2025-08-11 14:18:43,831 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "PUT /metadata?key=mooncake%2Fram%2Flocalhost%3A12625 HTTP/1.1" 200 176 "-" "-"
2025-08-11 14:18:43,832 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "PUT /metadata?key=mooncake%2Fram%2Flocalhost%3A12625 HTTP/1.1" 200 176 "-" "-"
2025-08-11 14:18:43,845 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "GET /metadata?key=mooncake%2Frpc_meta%2Flocalhost%3A12625 HTTP/1.1" 200 194 "-" "-"
2025-08-11 14:18:43,853 - aiohttp.access - INFO - 127.0.0.1 [11/Aug/2025:14:18:43 +0000] "PUT /metadata?key=mooncake%2Fram%2Flocalhost%3A12625 HTTP/1.1" 200 176 "-" "-"

    ```
    
    
    ```
    root@gh-3714u06:/sgl-workspace/sglang# MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" MOONCAKE_GLOBAL_SEGMENT_SIZE=4294967296 MOONCAKE_LOCAL_BUFFER_SIZE=134217728 MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server     --enable-hierarchical-cache     --hicache-storage-backend mooncake    --model-path /scratch/Qwen3-1.7B --port 42410
W0811 14:17:26.174000 10329 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0811 14:17:26.174000 10329 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-11 14:17:26] server_args=ServerArgs(model_path='/scratch/Qwen3-1.7B', tokenizer_path='/scratch/Qwen3-1.7B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=42410, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.874, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=860942763, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/scratch/Qwen3-1.7B', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=True, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend='mooncake', hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, tool_server=None, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-11 14:17:26] Using default HuggingFace chat template with detected content format: string
W0811 14:17:30.981000 10527 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0811 14:17:30.981000 10527 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0811 14:17:31.073000 10528 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0811 14:17:31.073000 10528 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-11 14:17:31] Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-08-11 14:17:31] Init torch distributed begin.
[W811 14:17:31.818152819 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-08-11 14:17:31] Init torch distributed ends. mem usage=0.00 GB
[2025-08-11 14:17:32] Load weight begin. avail mem=78.66 GB
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.50it/s]

[2025-08-11 14:17:33] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=75.22 GB, mem usage=3.44 GB.
[2025-08-11 14:17:33] KV Cache is allocated. #tokens: 611462, K size: 32.66 GB, V size: 32.66 GB
[2025-08-11 14:17:33] Memory pool end. avail mem=9.17 GB
[2025-08-11 14:17:33] Capture cuda graph begin. This can take up to several minutes. avail mem=8.47 GB
[2025-08-11 14:17:34] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (bs=1 avail_mem=7.92 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:01<00:00, 19.05it/s]
[2025-08-11 14:17:35] Capture cuda graph end. Time elapsed: 1.56 s. mem usage=0.58 GB. avail mem=7.90 GB.
[2025-08-11 14:17:35] max_total_num_tokens=611462, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=7.90 GB
[2025-08-11 14:17:35] Allocating 140.25 GB host memory for hierarchical KV cache.
[2025-08-11 14:18:43] Mooncake Configuration loaded from env successfully.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0811 14:18:43.820552 10527 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0811 14:18:43.820778 10527 client.cpp:43] client_id=11621013090028596196-18064563466103885725
I0811 14:18:43.822118 10527 client.cpp:236] Storage root directory is not set. persisting data is disabled.
I0811 14:18:43.822124 10527 client.cpp:86] auto discovery set by env MC_MS_AUTO_DISC
I0811 14:18:43.822127 10527 client.cpp:116] whitelist filters: mlx5_bond_0, mlx5_bond_1, mlx5_bond_2, mlx5_bond_3
I0811 14:18:43.822134 10527 transfer_engine.cpp:44] Transfer Engine starting. Server: localhost:12625, Metadata: http://127.0.0.1:8080/metadata, ip_or_host_name: localhost, rpc_port: 12625
I0811 14:18:43.822136 10527 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: localhost port: 12625
I0811 14:18:43.822871 10527 transfer_metadata_plugin.cpp:1053] Found active interface bond0 with IP *
I0811 14:18:43.822878 10527 transfer_metadata_plugin.cpp:1053] Found active interface docker0 with IP *
I0811 14:18:43.822880 10527 transfer_metadata_plugin.cpp:1053] Found active interface ib0.0065 with IP *
I0811 14:18:43.822885 10527 transfer_metadata_plugin.cpp:1053] Found active interface br-b8fb2a5138a1 with IP *
I0811 14:18:43.822896 10527 transfer_engine.cpp:114] Transfer Engine RPC using new RPC mapping, listening on *:16430
I0811 14:18:43.824179 10527 transfer_engine.cpp:138] Auto-discovering topology...
I0811 14:18:43.827481 10527 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
I0811 14:18:43.827837 10527 tcp_transport.cpp:250] TcpTransport: listen on port 15111
I0811 14:18:43.828161 10527 client.cpp:192] transport_type=tcp
W0811 14:18:43.828166 10527 transfer_engine.cpp:207] Transport tcp already installed
I0811 14:18:43.831763 10527 store_py.cpp:249] Mounting segment: 4294967296 bytes, 4294967296 of 4294967296
[2025-08-11 14:18:43] Connect to Mooncake store successfully.
[2025-08-11 14:18:43] Mooncake store warmup successfully.
[2025-08-11 14:18:43] HiCache storage prefetch policy: best_effort
[2025-08-11 14:18:45] INFO:     Started server process [10329]
[2025-08-11 14:18:45] INFO:     Waiting for application startup.
[2025-08-11 14:18:45] Ignoring mcp import error
[2025-08-11 14:18:45] Ignoring mcp import error
[2025-08-11 14:18:45] INFO:     Application startup complete.
[2025-08-11 14:18:45] INFO:     Uvicorn running on http://127.0.0.1:42410 (Press CTRL+C to quit)
[2025-08-11 14:18:46] INFO:     127.0.0.1:37160 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-08-11 14:18:46] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-08-11 14:18:46] INFO:     127.0.0.1:37162 - "POST /generate HTTP/1.1" 200 OK
[2025-08-11 14:18:46] The server is fired up and ready to roll!
[2025-08-11 14:18:53] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-08-11 14:18:53] Prefill batch. #new-seq: 1, #new-token: 3816, #cached-token: 0, token usage: 0.01, #running-req: 0, #queue-req: 0,
[2025-08-11 14:18:53] Decode batch. #running-req: 1, #token: 12041, token usage: 0.02, cuda graph: True, gen throughput (token/s): 4.63, #queue-req: 0,
[2025-08-11 14:18:53] Decode batch. #running-req: 1, #token: 12081, token usage: 0.02, cuda graph: True, gen throughput (token/s): 323.65, #queue-req: 0,
[2025-08-11 14:18:53] Decode batch. #running-req: 1, #token: 12121, token usage: 0.02, cuda graph: True, gen throughput (token/s): 326.67, #queue-req: 0,
[2025-08-11 14:18:53] Decode batch. #running-req: 1, #token: 12161, token usage: 0.02, cuda graph: True, gen throughput (token/s): 327.06, #queue-req: 0,
[2025-08-11 14:18:54] Decode batch. #running-req: 1, #token: 12201, token usage: 0.02, cuda graph: True, gen throughput (token/s): 327.10, #queue-req: 0,
[2025-08-11 14:18:54] INFO:     127.0.0.1:51830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-08-11 14:19:36] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 12007, token usage: 0.02, #running-req: 0, #queue-req: 0,
[2025-08-11 14:19:36] Decode batch. #running-req: 1, #token: 12041, token usage: 0.02, cuda graph: True, gen throughput (token/s): 0.93, #queue-req: 0,
[2025-08-11 14:19:37] Decode batch. #running-req: 1, #token: 12081, token usage: 0.02, cuda graph: True, gen throughput (token/s): 325.79, #queue-req: 0,
[2025-08-11 14:19:37] Decode batch. #running-req: 1, #token: 12121, token usage: 0.02, cuda graph: True, gen throughput (token/s): 326.51, #queue-req: 0,
[2025-08-11 14:19:37] Decode batch. #running-req: 1, #token: 12161, token usage: 0.02, cuda graph: True, gen throughput (token/s): 325.95, #queue-req: 0,
[2025-08-11 14:19:37] Decode batch. #running-req: 1, #token: 12201, token usage: 0.02, cuda graph: True, gen throughput (token/s): 325.00, #queue-req: 0,
[2025-08-11 14:19:37] INFO:     127.0.0.1:34658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Fatal Python error: Aborted

Thread 0x00007eea0bfff640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2117 in watchdog_thread
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x00007eeaefffe640 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 252 in _put_batch_zero_copy_impl
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 191 in batch_set
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 739 in mooncake_page_backup
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 775 in backup_thread_func
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007eeb0fffe640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 595 in prefetch_io_aux_func
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007eeb5ffff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 624 in prefetch_thread_func
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007eeb9ffff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 470 in load_thread_func_layer_by_layer
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007eead7fff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/queue.py", line 180 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 447 in write_thread_func_direct
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f31f7fff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 355 in wait
  File "/usr/lib/python3.12/queue.py", line 171 in get
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 153 in forward_thread_func_
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 141 in forward_thread_func
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f3215fff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f431dfff640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f455605e640 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 61 in _recv_msg
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 195 in _read_thread
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f4b6eb47480 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/socket.py", line 989 in recv_pyobj
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1320 in recv_requests
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 814 in event_loop_overlap
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2571 in run_scheduler_process
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, setproctitle._setproctitle, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._mt19937, numpy.random._generator, numpy.random._philox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, _cyutility, scipy._cyutility, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation, scipy.spatial.transform._rigid_transform, scipy.optimize._direct, sentencepiece._sentencepiece, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, cuda.bindings._lib.utils, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings._lib.cyruntime.utils, cuda.bindings._lib.cyruntime.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings.utils._get_handle, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, msgspec._core, cuda_utils, __triton_launcher (total: 115)

@xiaguan
Copy link
Copy Markdown
Contributor

xiaguan commented Aug 12, 2025

@SzymonOzog
Could you share your mooncake configuration? I noticed you've enabled some flags.

Regarding the segmentation fault, we have insufficient information to pinpoint where the error occurs (whether in sglang or mooncake). However, we'll consider how to improve this situation in Mooncake.

@SzymonOzog
Copy link
Copy Markdown
Contributor

@xiaguan Other than MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" MOONCAKE_GLOBAL_SEGMENT_SIZE=4294967296 MOONCAKE_LOCAL_BUFFER_SIZE=134217728 MOONCAKE_MASTER=127.0.0.1:50051 from the command which are mapped in here:

def load_from_env() -> "MooncakeStoreConfig":
I'm having everything at defaults. I'm also investigating atm, will keep you updated once I find something

@SzymonOzog
Copy link
Copy Markdown
Contributor

@xiaguan

It seems like the problem is with CUDA and TCP, switching to RDMA backend fixed it for me. I've opened a pull request to mooncake with one fix but ran into more issues. Here are the back traces from the last segfault:

(gdb) bt
#0  0x00007f6e1ffc7944 in _Unwind_GetRegionStart () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007f6e199c9a49 in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007f6b84274fe9 in __libunwind_Unwind_Resume () from /lib/x86_64-linux-gnu/libunwind.so.8
#3  0x00007f678c7d7448 in asio::detail::do_throw_error (err=std::error_code = {std::_V2::error_category: 99}, location=location@entry=0x7f678c9ba7d7 "connect") at /usr/local/include/asio/detail/impl/throw_error.ipp:63
#4  0x00007f678c7d128b in asio::detail::throw_error (location=0x7f678c9ba7d7 "connect", err=std::error_code = {std::_V2::error_category: 99}) at /usr/local/include/asio/detail/throw_error.hpp:50
#5  asio::connect<asio::ip::tcp, asio::any_io_executor, asio::ip::basic_resolver_results<asio::ip::tcp> > (endpoints=..., s=...) at /usr/local/include/asio/impl/connect.hpp:114
#6  mooncake::TcpTransport::startTransfer (this=<optimized out>, slice=0x7f4749e95f60) at /sgl-workspace/sglang/Mooncake/mooncake-transfer-engine/src/transport/tcp_transport/tcp_transport.cpp:474
#7  0x00007f678c7d1ae9 in mooncake::TcpTransport::submitTransferTask (this=0x3fff0ba0, task_list=std::vector of length 1, capacity 1 = {...}) at /sgl-workspace/sglang/Mooncake/mooncake-transfer-engine/src/transport/tcp_transport/tcp_transport.cpp:432
#8  0x00007f678c992f69 in mooncake::MultiTransport::submitTransfer (this=0x39a395d0, batch_id=batch_id@entry=139944159436480, entries=std::vector of length 1, capacity 1 = {...}) at /sgl-workspace/sglang/Mooncake/mooncake-transfer-engine/src/multi_transport.cpp:104
#9  0x00007f678c8f299b in mooncake::TransferEngine::submitTransfer (entries=std::vector of length 1, capacity 1 = {...}, batch_id=139944159436480, this=<optimized out>) at /sgl-workspace/sglang/Mooncake/mooncake-transfer-engine/include/transfer_engine.h:124
#10 mooncake::TransferSubmitter::submitTransferEngineOperation (this=0x40037a40, handles=..., slices=std::vector of length 1, capacity 1 = {...}, op_code=mooncake::Transport::TransferRequest::WRITE) at /sgl-workspace/sglang/Mooncake/mooncake-store/src/transfer_task.cpp:505
#11 0x00007f678c8f3282 in mooncake::TransferSubmitter::submit (this=0x40037a40, replica=..., slices=std::vector of length 1, capacity 1 = {...}, op_code=op_code@entry=mooncake::Transport::TransferRequest::WRITE) at /sgl-workspace/sglang/Mooncake/mooncake-store/src/transfer_task.cpp:415
#12 0x00007f678c80a102 in mooncake::Client::SubmitTransfers (this=0x402f2a10, ops=...) at /sgl-workspace/sglang/Mooncake/mooncake-store/src/client.cpp:708
#13 0x00007f678c80e3fe in mooncake::Client::BatchPut (this=0x402f2a10, keys=std::vector of length 134792, capacity 134792 = {...}, batched_slices=std::vector of length 134792, capacity 134792 = {...}, config=...) at /sgl-workspace/sglang/Mooncake/mooncake-store/src/client.cpp:944
#14 0x00007f678c951abd in mooncake::PyClient::batch_put_from_internal (this=0x4053c3a0, keys=std::vector of length 134792, capacity 134792 = {...}, buffers=std::vector of length 134792, capacity 134792 = {...}, sizes=std::vector of length 134792, capacity 134792 = {...}, config=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1295
#15 0x00007f678c951b8e in mooncake::PyClient::batch_put_from (this=this@entry=0x4053c3a0, keys=std::vector of length 134792, capacity 134792 = {...}, buffers=std::vector of length 134792, capacity 134792 = {...}, sizes=std::vector of length 134792, capacity 134792 = {...}, config=...) at /sgl-workspace/sglang/Mooncake/mooncake-store/src/pybind_client.cpp:760
#16 0x00007f678c7a7716 in operator() (__closure=<optimized out>, config=..., sizes=std::vector of length 134792, capacity 134792 = {...}, buffer_ptrs=std::vector of length 134792, capacity 134792 = {...}, keys=std::vector of length 134792, capacity 134792 = {...}, self=...) at /sgl-workspace/sglang/Mooncake/mooncake-integration/store/store_py.cpp:505
#17 pybind11::detail::argument_loader<mooncake::MooncakeStorePyWrapper&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, mooncake::ReplicateConfig const&>::call_impl<std::vector<int>, mooncake::pybind11_i
nit_store(pybind11::module_&)::<lambda(mooncake::MooncakeStorePyWrapper&, const std::vector<std::__cxx11::basic_string<char> >&, const std::vector<long unsigned int>&, const std::vector<long unsigned int>&, const mooncake::ReplicateConfig&)>&, 0, 1, 2, 3, 4, pybind11::detail::void_type> (f=..., this=0x7f477bffd180) at /sgl-workspace/sglang/Mooncake/extern/pybind11/include/pybind11/cast.h:1631
#18 pybind11::detail::argument_loader<mooncake::MooncakeStorePyWrapper&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, mooncake::ReplicateConfig const&>::call<std::vector<int>, pybind11::detail::void_ty
pe, mooncake::pybind11_init_store(pybind11::module_&)::<lambda(mooncake::MooncakeStorePyWrapper&, const std::vector<std::__cxx11::basic_string<char> >&, const std::vector<long unsigned int>&, const std::vector<long unsigned int>&, const mooncake::ReplicateConfig&)>&> (f=..., this=0x7f477bffd180) at /sgl-workspace/sglang/Mooncake/extern/pybind11/include/pybind11/cast.h:1600
#19 operator() (__closure=0x0, call=...) at /sgl-workspace/sglang/Mooncake/extern/pybind11/include/pybind11/pybind11.h:278
#20 _FUN () at /sgl-workspace/sglang/Mooncake/extern/pybind11/include/pybind11/pybind11.h:249
#21 0x00007f678c7c3299 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7f6776960220, kwargs_in=0x0) at /sgl-workspace/sglang/Mooncake/extern/pybind11/include/pybind11/pybind11.h:971
#22 0x000000000056ac3d in ?? ()
#23 0x000000000053d2ab in _PyObject_MakeTpCall ()
#24 0x0000000000547f11 in _PyEval_EvalFrameDefault ()
#25 0x00000000005975fd in ?? ()
#26 0x00000000005971c6 in ?? ()
#27 0x000000000054cc4c in _PyEval_EvalFrameDefault ()
#28 0x00000000005975fd in ?? ()
#29 0x00000000005971c6 in ?? ()
#30 0x000000000054cc4c in _PyEval_EvalFrameDefault ()
#31 0x000000000053fb08 in _PyObject_FastCallDictTstate ()
#32 0x000000000057c2f9 in _PyObject_Call_Prepend ()
#33 0x000000000066735d in ?? ()
#34 0x000000000057ef23 in _PyObject_Call ()
#35 0x00000000006a5709 in ?? ()
#36 0x00000000006a56b8 in ?? ()
#37 0x00007f6e21572ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#38 0x00007f6e21603a04 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(gdb) frame 4
#4  0x00007f678c7d128b in asio::detail::throw_error (location=0x7f678c9ba7d7 "connect", err=std::error_code = {std::_V2::error_category: 99}) at /usr/local/include/asio/detail/throw_error.hpp:50
50          do_throw_error(err, location ASIO_SOURCE_LOCATION_ARG);
(gdb) location
Undefined command: "location".  Try "help".
(gdb) list
45          const asio::error_code& err,
46          const char* location
47          ASIO_SOURCE_LOCATION_DEFAULTED_PARAM)
48      {
49        if (err)
50          do_throw_error(err, location ASIO_SOURCE_LOCATION_ARG);
51      }
52
53      } // namespace detail
54      } // namespace asio
(gdb) frame 3
#3  0x00007f678c7d7448 in asio::detail::do_throw_error (err=std::error_code = {std::_V2::error_category: 99}, location=location@entry=0x7f678c9ba7d7 "connect") at /usr/local/include/asio/detail/impl/throw_error.ipp:63
63      }
(gdb) list
58        // boostify: non-boost code starts here
59      #endif // defined(ASIO_MSVC)
60             //   && defined(ASIO_HAS_STD_SYSTEM_ERROR)
61             //   && (_MSC_VER < 1800)
62        // boostify: non-boost code ends here
63      }
64
65      } // namespace detail
66      } // namespace asio
67

Looking at how it's inside asio and when throwing an error I'd assume a memory leak somewhere else. At the moment I'll just switch fully to RDMA as it's probably faster anyway but if you want to debug this further I'm happy to provide help

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: zuoyuan <zhangzuo21@mails.tsinghua.edu.cn>
Co-authored-by: @wangyueneng.wyn <wangyueneng.wyn@antgroup.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com>
Co-authored-by: zuoyuan <zhangzuo21@mails.tsinghua.edu.cn>
Co-authored-by: @wangyueneng.wyn <wangyueneng.wyn@antgroup.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
yuan-luo pushed a commit to antgroup/sglang that referenced this pull request Sep 18, 2025
Merge branch 'sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main

https://code.alipay.com/Theta/SGLang/pull_requests/192


Reviewed-by: 得泽 <zhangkaihong.zkh@antgroup.com>


* fix duplicate args in schedule_batch (sgl-project#7816)
* [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187)
* docs: update README (sgl-project#7821)
* [theta] add py-spy deps
* feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762)
* Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129)
* [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850)
* [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853)
* Fix cache modules of triton import error (sgl-project#7832)
* [router] forward stream_options in request (sgl-project#7860)
* Fix illegal memory in trtllm allreduce fusion (sgl-project#7864)
* Fix llama4 vision (sgl-project#7840)
* Support Mimo-VL (sgl-project#7579)
* fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830)
* [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724)
* Bump xgrammar's version to 0.1.20 (sgl-project#7866)
* [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818)
* [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846)
* Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660)
* Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880)
* chore: bump v0.4.9.post1 (sgl-project#7882)
* Fixes typo in assertion message (sgl-project#7895)
* [CI] Add deepep tests to CI (sgl-project#7872)
* [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885)
* [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838)
* Remove unused imports (sgl-project#7898)
* [router] Update metrics when request completes (sgl-project#7899)
* [feature] Add start step profile argument in /start_profile (sgl-project#7608)
* [bugfix] add pd router policy validation (sgl-project#7904)
* vlm: support video as an input modality (sgl-project#5888)
* Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907)
* add sentencepiece as dependency explicitly (sgl-project#7922)
* Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449)
* [feature]Ascend quantization support (sgl-project#7791)
* [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775)
* Support Kimi K2 (sgl-project#7940)
* [feature] kv transfer support of ascend npu (sgl-project#7795)
* fix: minor fix for modelopt weight load compatibility (sgl-project#7953)
* temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961)
* [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932)
* chore: bump sgl-kernel v0.2.5 (sgl-project#7964)
* Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968)
* chore: upgrade xgrammar 0.1.21 (sgl-project#7962)
* delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970)
* Fix wrong gemm branch cause 250us slower (sgl-project#7969)
* [router] add worker abstraction (sgl-project#7960)
* chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971)
* chore: bump v0.4.9.post2 (sgl-project#7963)
* [minor fix] llama4 hybrid memory (sgl-project#7950)
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h20_3e
* [theta] perf: add kimi k2 fused_moe tuning config for h200
* [minor fix] SWA missing methods (sgl-project#7972)
* [script] update loogle test (sgl-project#7975)
* perf: add kimi k2 fused_moe tuning config for h30_3e
* docs: update README (sgl-project#7985)
* Overlap the gating function with shared experts in DeepSeek (sgl-project#7978)
* [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814)
* [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988)
* Tiny fix mooncake log warning wrong output (sgl-project#7952)
* [BugFix] add verify logit_bias to avoid crash because of IndexError  (sgl-project#7749)
* SWA Prefix Cache (sgl-project#7367)
* chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997)
* Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844)
* Support for Phi-1.5 & Phi-2 models (sgl-project#7862)
* [Dockerfile] Multi-arch support for ROCm (sgl-project#7902)
* [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010)
* perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021)
* [ci] CI supports use cached models (sgl-project#7874)
* [Minor] Remove redundant print (sgl-project#8005)
* [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653)
* [Feature] CUDA Green Context Support (sgl-project#7649)
* Fix flaky CI: test_vlm_models (sgl-project#8006)
* Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982)
* prevent server crash from potential invalid grammar (sgl-project#7897)
* Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035)
* fix: modality length mismatch with image_data (sgl-project#7887)
* Update CODEOWNERS (sgl-project#8044)
* perf: add qwen3-30b-a3b fused moe tuning config for h20
* [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023)
* feat: update multimodal data handling in engine entrypoint (sgl-project#8002)
* fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022)
* Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)
* fix: resolve arm build issue (sgl-project#8052)
* concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943)
* H20 tune config for Kimi (sgl-project#8047)
* Update amd docker image. (sgl-project#8045)
* feat: replace Decord with video_reader-rs (sgl-project#5163)
* remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058)
* update transformers to 4.53.2 (sgl-project#8029)
* Fix different device type adjustment in PP (sgl-project#7760)
* Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001)
* Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077)
* Fix CI xeon test with triton 3.3.1 (sgl-project#8086)
* fix greenctx stream compability (sgl-project#8090)
* [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098)
* [Feature] Layer-wise Prefill (sgl-project#7634)
* [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992)
* refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075)
* feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597)
* [ci] limit cmake build nproc (sgl-project#8100)
* [ci] disable memory imbalance check for draft worker (sgl-project#8108)
* [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110)
* [ci] recover 8-gpu deepep test (sgl-project#8105)
* Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989)
* [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884)
* Super tiny fix typo (sgl-project#8046)
* fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102)
* [Hunyuan]: Fix Dense Model Support (sgl-project#8117)
* feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030)
* refactor: simply MultimodalTokens logic (sgl-project#7924)
* [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127)
* Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959)
* load draft model fix (sgl-project#7506)
* [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input"  (sgl-project#7889)
* [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820)
* Hicache Storage Layer Prototype (sgl-project#7704)
* Revert "Fix different device type adjustment in PP" (sgl-project#8141)
* feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136)
* fix compressed tensors WNA16 imports (sgl-project#8142)
* [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147)
* [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113)
* [router] Refactor router and policy traits with dependency injection (sgl-project#7987)
* [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661)
* [Doc] Steps to add a new attention backend (sgl-project#8155)
* chore: tune mem fraction static for vlm (sgl-project#6881)
* Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302)
* Feat: Support audio in Phi4-mm model (sgl-project#8048)
* [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931)
* [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028)
* [router] router metrics cleanup (sgl-project#8158)
* [router] allow router to have empty workers (sgl-project#8160)
* Add GB200 wide-EP docker (sgl-project#8157)
* [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966)
* chore: bump sgl-kernel v0.2.6 (sgl-project#8165)
* chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166)
* [theta] sync bailing
* Fix suffix mismatch for the metrics. (sgl-project#8168)
* Update README.md (sgl-project#8171)
* Clean up server args (sgl-project#8161)
* Fix LoRA buffer contamination during adapter eviction (sgl-project#8103)
* Fix Dockerfile.gb200 (sgl-project#8169)
* [router] add ut for worker and errors (sgl-project#8170)
* bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167)
* Support start up LoRA server without initial adapters (sgl-project#8019)
* Clean warning logs for gate_proj loading in Lora (sgl-project#8172)
* Fix tuning_fused_moe_triton.py (sgl-project#8175)
* [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115)
* Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999)
* Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230)
* Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176)
* feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178)
* fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825)
* Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181)
* feat: update nccl 2.27.6 (sgl-project#8182)
* Feat: Support for Persimmon Model (sgl-project#7983)
* feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183)
* Fix eagle3 cuda graph (sgl-project#8163)
* fix: fix the bug of loading Internvl3 (sgl-project#8067)
* Fix dtype error in CI (sgl-project#8197)
* Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支
* [router] add ut for pd request, metrics and config (sgl-project#8184)
* [feature] enable NPU CI (sgl-project#7935)
* [fix] fix modelopt fp4 on b200 (sgl-project#8195)
* chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200)
* Apply fused sorted token ids padding (sgl-project#8193)
* [Refactor] simplify multimodal data processing (sgl-project#8107)
* [theta] feat vl name
* [router] add ut for pd router (sgl-project#8208)
* [router] upgade router version to 0.1.6 (sgl-project#8209)
* Remve router gemm output dtype conversion (sgl-project#8204)
* chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202)
* [Feature] Add a test for Layer-wise Prefill (sgl-project#8231)
* docs: update 2025 h2 roadmap (sgl-project#8237)
* fix: retrieve mm token by modality, raise error if none (sgl-project#8221)
* [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484)
* [theta] tune h20 config for qwen3 235b
* [theta] tune h20 config for qwen3 235b
* fix: sgl-router remove dead code (sgl-project#8257)
* [fix] benchmark : routed_scaling_factor is None (sgl-project#8059)
* [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822)
* Preliminary Support for Qwen3XMLDetector (sgl-project#8260)
* chore: bump v0.4.9.post3 (sgl-project#8265)
* PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config
* [theta] tune h20 config for qwen3 480b
* Skip llama4 vision module loading when multimodal disabled (sgl-project#8272)
* PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置
* Fix sgl-kernel ci test (sgl-project#8284)
* [theta] tune h200 config for qwen3 480b
* Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261)
* Hicache IO kernel refactoring (sgl-project#8264)
* bug fix and tag (sgl-project#8282)
* HiCache Fix (sgl-project#8288)
* [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130)
* [router] add common ut infra to mock worker and app (sgl-project#8295)
* fix: workaround for deepgemm warmup issue (sgl-project#8302)
* [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133)
* Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708)
* fix: match chat-template for internvl3 (sgl-project#8262)
* Fix gemma3n with hybrid swa (sgl-project#8240)
* chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304)
* fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685)
* feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043)
* Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP  (sgl-project#7562)
* [AMD] Pull latest image for AMD CI (sgl-project#8070)
* Pin the version of petit kernel to fix the APIs (sgl-project#8235)
* [bug] fix pd completion protocol for batching support (sgl-project#8317)
* [router] fix pd model completion request (sgl-project#8303)
* fix bug when eos_ids==0 (sgl-project#8315)
* [router] add endpoint unit test (sgl-project#8298)
* [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310)
* chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301)
* [router] add streaming unit test (sgl-project#8299)
* [router] add request format unit test (sgl-project#8300)
* HiCache Storage TP Refinement (sgl-project#8307)
* breakdown kernel update (sgl-project#8334)
* support idle batch for TBO (sgl-project#8233)
* [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619)
* DP Enhancement (sgl-project#8280)
* fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266)
* [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135)
* [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000)
* chore: upgrade mooncake 0.3.5 (sgl-project#8341)
* [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353)
* [P/D] Support ipv6 in P/D scenario (sgl-project#7858)
* Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344)
* [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357)
* Clean up server_args, triton cache manager (sgl-project#8332)
* fix: upgrade nccl version (sgl-project#8359)
* [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363)
* fix: kimi k2 xgrammar crash (sgl-project#8367)
* Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333)
* [CI] Fix flaky threshold (sgl-project#8370)
* chore: bump v0.4.9.post4 (sgl-project#8305)
* Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374)
* Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356)
* chore: improve mmmu benchmark (sgl-project#7000)
* Save peak memory in logits processor (sgl-project#8343)
* Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267)
* chore: improvements on mm_utils (sgl-project#7737)
* vlm: optimize tensor transport (sgl-project#6003)
* Tiny assert EPLB is used together with expert parallel (sgl-project#8381)
* model: support intern-s1 (sgl-project#8350)
* Add perf tests for LoRA (sgl-project#8314)
* Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396)
* Add docker release flow for gb200 (sgl-project#8394)
* HiCache, check before terminate prefetching (sgl-project#8372)
* Add nvfp4 scaled mm benchmark. (sgl-project#8401)
* Urgent Fix: intern-s1 chat-template matching (sgl-project#8403)
* Tool to dump and compare internal activation tensors (sgl-project#7976)
* Minor tool for comparison of benchmark results (sgl-project#7974)
* Fix bench script making input data on L2 cache (sgl-project#7739)
* [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036)
* Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392)
* fix: minor fix TransportProxyTensor under tp (sgl-project#8382)
* [router] add different policies for p node and d node (sgl-project#8395)
* Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351)
* fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720)
* [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397)
* Use FlashInfer FP4 gemm. (sgl-project#8241)
* Support precomputed_embeddings for Llama 4 (sgl-project#8156)
* [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405)
* chore: update CODEOWNERS (sgl-project#8407)
* chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406)
* Support triton kernels v3.4.0 for fused_moe (sgl-project#8258)
* [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062)
* Change to use native arm runner (sgl-project#8414)
* Support overlapped lora updates  (sgl-project#8213)
* Support ue8m0 for triton quant kernel (sgl-project#7603)
* Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316)
* bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417)
* Fix test_openai_server (sgl-project#8419)
* Fix docker buildx push error (sgl-project#8425)
* bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422)
* [router] improve router logs and request id header (sgl-project#8415)
* [feat] Support different attention backends for prefill and decode  (sgl-project#6338)
* chore: bump transformer to 4.54.0 (sgl-project#8416)
* [PD] Fix abort_request for PD disaggregation (sgl-project#8352)
* GLM-4.5 Model Support (sgl-project#8224)
* Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442)
* doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441)
* GLM-4.5 Model Support Follow-up (sgl-project#8445)
* fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456)
* Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449)
* Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457)
* chore: bump v0.4.9.post5 (sgl-project#8458)
* fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125)
* perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049)
* Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支
* Update PR template (sgl-project#8465)
* feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565)
* [theta] tuning script for glm4 moe
* perf: add fused moe kernel config glm4.5,h20-3e,tp8
* [theta] tuning script for glm4 moe h20
* fix: update dep (sgl-project#8467)
* [NVIDIA] Change to use `num_local_experts` (sgl-project#8453)
* Fix parsing ChatCompletionMessage (sgl-project#7273)
* [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421)
* feat: support glm4 tuning (sgl-project#8473)
* Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469)
* Update codeowner (sgl-project#8476)
* chore: add glm4 fp8 tp8 config (sgl-project#8478)
* chore: add glm 4.5 fp8 tp4 config (sgl-project#8480)
* [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477)
* Update CODEOWNERS (sgl-project#8485)
* Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486)
* Reduce memory usage for fp4 moe (sgl-project#8413)
* Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426)
* Support colocating requests (sgl-project#7973)
* Fix incorrect KV cache allocation for MTP models. (sgl-project#8482)
* Add PVC and update resource limits in k8s config (sgl-project#8489)
* chore: bump v0.4.9.post6 (sgl-project#8517)
* Always trigger pr-test (sgl-project#8527)
* Update README.md (sgl-project#8528)
* [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499)
* Update cutlass_moe.py (sgl-project#8535)
* Fix moe align kernel test (sgl-project#8531)
* Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483)
* bring back kimi vl ci (sgl-project#8537)
* fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431)
* Support EPLB in FusedMoE (sgl-project#8448)
* feat(hicache): support file backend reading directory config form env. (sgl-project#8498)
* feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516)
* [router] allow longer time out for router e2e (sgl-project#8560)
* Update cutlass_moe.py (sgl-project#8545)
* Update CODEOWNERS (sgl-project#8562)
* [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869)
* [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461)
* Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支
* [Model] Add support for Arcee Foundational Model (sgl-project#8154)
* Revert "Fix the input tools format and history tool_calls in OpenAI API  (sgl-project#6556)" (sgl-project#8584)
* Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280)
* [router] migrate router from actix to axum (sgl-project#8479)
* [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564)
* Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541)
* Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544)
* Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365)
* update sgl-kernel for EP: kernel part  (sgl-project#8514)
* chore: bump sgl-kernel v0.2.8 (sgl-project#8599)
* [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404)
* fix incorrect increase of hit count (sgl-project#8533)
* Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211)
* [theta] Conditionally import HiCacheHF3FS sgl-project#8598
* update sgl-kernel for EP: python part (sgl-project#8550)
* add SVG logo (sgl-project#8603)
* [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515)
* fix: fork should not run pypi router (sgl-project#8604)
* model: support Step3V (sgl-project#8583)
* [Feature] Hybrid EP and TP (sgl-project#8590)
* chore: bump v0.4.10 (sgl-project#8608)
* [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595)
* [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611)
* Fix hf3fs_fuse import error (sgl-project#8623)
* Update step3v default config (sgl-project#8626)
* [ci] fix genai-bench execution cmd (sgl-project#8629)
* [router] update router pypi version (sgl-project#8628)
* [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577)
* Fix typos in py_test/test_launch_server.py (sgl-project#6227)
* misc: Remove debug print to logger.info (sgl-project#8633)
* SGLang HiCache NIXL Connector (sgl-project#8488)
* [bug] remove pdlb from minilb since its no longer available (sgl-project#8634)
* [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630)
* Conditionally import HiCacheHF3FS (sgl-project#8598)
* TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632)
* Fix nan value generated after custom all reduce (sgl-project#8532)
* Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642)
* Feature/modelscope model download (sgl-project#8083)
* chore: speedup NPU CI by cache (sgl-project#8270)
* [Bugfix] fix w8a8_int8 load issue (sgl-project#8308)
* [bugfix] fix router python parser for pd urls (sgl-project#8644)
* [router] add basic usage doc (sgl-project#8640)
* [router] upgrade router version to 0.1.8 (sgl-project#8645)
* [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450)
* HiCache, fixing hash value indexing (sgl-project#8636)
* Interface change for kvcache io to support page first layout (sgl-project#8318)
* Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051)
* chore: bump v0.4.10.post1 (sgl-project#8652)
* Add hf3fs_utils.cpp to package-data (sgl-project#8653)
* Fix chat template handling for OpenAI serving (sgl-project#8635)
* Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511)
* [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658)
* Increase tolerance to address CI failures (sgl-project#8643)
* [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013)
* [DOC]Update sgl-kernel README (sgl-project#8665)
* fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543)
* fix arg typo for --disaggregation-transfer-backend (sgl-project#8664)
* [fix] fix pd disagg error of vlms (sgl-project#8094)
* Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647)
* [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685)
* [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684)
* Update CODEOWNERS (sgl-project#8686)
* Fix deepgemm masked grouped gemm jit compile (sgl-project#8679)
* Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648)
* bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668)
* upgrade xgrammar 0.1.22 (sgl-project#8522)
* [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618)
* Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238)
* [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364)
* chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694)
* [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688)
* Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434)
* model: adapt mllama4 to VisionAttention (sgl-project#8512)
* Add tensor.detach() back to update weight util (sgl-project#8691)
* [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704)
* [theta] merge 0802-3
* Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706)
* [router] minor code clean up and and refactoring (sgl-project#8711)
* [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701)
* chore: bump sgl-kernel v0.2.9 (sgl-project#8713)
* Remove assertions about per group quant fp8 (sgl-project#8717)
* [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693)
* Fix triton moe error caused by TopK refactor (sgl-project#8705)
* [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714)
* [Feature] Radix Tree in C++ (sgl-project#7369)
* [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722)
* Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709)
* Tiny fix CI pytest error (sgl-project#8524)
* [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721)
* Support limiting max loaded loras in CPU. (sgl-project#8650)
* Reduce memory accumulation in long-running server (sgl-project#8306)
* HiCache storage, style change and bug fix (sgl-project#8719)
* [feat] support minimum token load balance in dp attention (sgl-project#7379)
* Do layernorm before allgather for DP attention (sgl-project#8631)
* [fix] Fix divide by zero error for llama4. (sgl-project#8683)
* feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547)
* [Improvements] Merge health check route (sgl-project#8444)
* chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718)
* Save cuda graph memory for fa3 (sgl-project#8567)
* [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579)
* [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723)
* chore: bump v0.4.10.post2 (sgl-project#8727)
* feat: Support DP Attention for step3_vl (sgl-project#8699)
* [RL] fix update weight for FusedMoE with EP (sgl-project#8676)
* use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729)
* Fix triton kernels topk with keyword arguments (sgl-project#8732)
* feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678)
* Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114)
* fix args typo in memory_pool_host (sgl-project#8662)
* [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737)
* [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735)
* Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733)
* Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745)
* Rename lora_path to lora_id in batches (sgl-project#8437)
* [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738)
* [CI] Ascend NPU CI enhancement (sgl-project#8294)
* [bugfix] fix import path in HiCacheController (sgl-project#8749)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority ready-to-merge The PR is ready to merge after the CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.