Skip to content

[Feature] Performance regression and cache behavior change between sglang 0.5.5 and 0.5.6+ with Mooncake #16797

@dcosmos

Description

@dcosmos

Checklist

Motivation

Title

Performance regression between sglang 0.5.5 and 0.5.6+ when using Mooncake

Description

I would like to report a performance difference observed across different versions of sglang when used together with Mooncake.

We are running an online service with sglang + Mooncake and applying sustained benchmark traffic.

Observed Behavior

sglang 0.5.5 + Mooncake 0.3.7

Under continuous benchmark workload:

  • TTFT is relatively low
  • Memory usage of the scheduler process keeps increasing over time

sglang 0.5.6 / 0.5.7 + Mooncake 0.3.7

With the Mooncake version unchanged:

  • Scheduler memory no longer shows continuous growth
  • TTFT increases significantly
  • Cache hit rate remains consistently low

The behavior observed in sglang 0.5.7 is consistent with 0.5.6.

Reproduction Conditions

Environment Variables

export MOONCAKE_TE_META_DATA_SERVER="etcd://etcdn1.mooncake-c1.dns.org:2379;etcd://etcdn2.mooncake-c1.dns.org:2379;etcd://etcdn3.mooncake-c1.dns.org:2379;etcd://etcdn4.mooncake-c1.dns.org:2379;etcd://etcdn5.mooncake-c1.dns.org:2379"
export MOONCAKE_MASTER="etcd://etcdn1.mooncake-c1.dns.org:2379;etcd://etcdn2.mooncake-c1.dns.org:2379;etcd://etcdn3.mooncake-c1.dns.org:2379;etcd://etcdn4.mooncake-c1.dns.org:2379;etcd://etcdn5.mooncake-c1.dns.org:2379"
export MOONCAKE_PROTOCOL="tcp"
export MOONCAKE_DEVICE=""
export MOONCAKE_GLOBAL_SEGMENT_SIZE=0
nohup python3 -m sglang.launch_server \
      --model-path Qwen/Qwen1.5-1.8B-Chat \
      --trust-remote-code \
      --host 0.0.0.0 \
      --port 30000 \
      --mem-fraction-static 0.85 \
      --enable-hierarchical-cache \
      --hicache-size 100 \
      --page-size 64 \
      --hicache-io-backend kernel \
      --hicache-mem-layout page_first \
      --hicache-storage-backend mooncake &

Benchmark Results Comparison

Below are benchmark results collected under identical workload, environment, and configuration settings.

sglang 0.5.5 + Mooncake 0.3.7

Round 0:  Average TTFT = 0.29s, Cache Hit Rate = 0.000000 (48 requests)
Round 1:  Average TTFT = 0.29s, Cache Hit Rate = 0.495324 (48 requests)
Round 2:  Average TTFT = 0.40s, Cache Hit Rate = 0.664739 (48 requests)
Round 3:  Average TTFT = 0.41s, Cache Hit Rate = 0.744979 (48 requests)
Round 4:  Average TTFT = 0.51s, Cache Hit Rate = 0.715385 (48 requests)
Round 5:  Average TTFT = 0.93s, Cache Hit Rate = 0.557634 (48 requests)
Round 6:  Average TTFT = 1.20s, Cache Hit Rate = 0.599504 (48 requests)
Round 7:  Average TTFT = 0.98s, Cache Hit Rate = 0.634969 (48 requests)
Round 8:  Average TTFT = 0.61s, Cache Hit Rate = 0.888961 (48 requests)
Round 9:  Average TTFT = 0.54s, Cache Hit Rate = 0.899401 (48 requests)
Round 10: Average TTFT = 0.94s, Cache Hit Rate = 0.909295 (48 requests)
Round 11: Average TTFT = 1.05s, Cache Hit Rate = 0.916223 (48 requests)

Overall, cache hit rate steadily improves over rounds, while TTFT remains relatively stable and low.

sglang 0.5.6 + Mooncake 0.3.7

Round 0:  Average TTFT = 0.29s, Cache Hit Rate = 0.000000 (48 requests)
Round 1:  Average TTFT = 0.29s, Cache Hit Rate = 0.494886 (48 requests)
Round 2:  Average TTFT = 0.41s, Cache Hit Rate = 0.663523 (48 requests)
Round 3:  Average TTFT = 0.90s, Cache Hit Rate = 0.409318 (48 requests)
Round 4:  Average TTFT = 1.88s, Cache Hit Rate = 0.051335 (48 requests)
Round 5:  Average TTFT = 2.38s, Cache Hit Rate = 0.000000 (48 requests)
Round 6:  Average TTFT = 1.81s, Cache Hit Rate = 0.317494 (48 requests)
Round 7:  Average TTFT = 2.37s, Cache Hit Rate = 0.287596 (48 requests)
Round 8:  Average TTFT = 2.66s, Cache Hit Rate = 0.304387 (48 requests)
Round 9:  Average TTFT = 2.88s, Cache Hit Rate = 0.321815 (48 requests)
Round 10: Average TTFT = 3.33s, Cache Hit Rate = 0.310904 (48 requests)
Round 11: Average TTFT = 5.54s, Cache Hit Rate = 0.197224 (48 requests)

Compared to 0.5.5, TTFT increases significantly starting from Round 3, while cache hit rate drops sharply and remains at a relatively low level across subsequent rounds.

Expected / Questions

I am opening this issue mainly to ask about the following two questions:

1. Cause of behavioral differences

What causes the behavioral differences between sglang 0.5.5 and sglang 0.5.6+ in this setup?

In particular, are there any known changes related to the scheduler, hierarchical cache, or Mooncake integration that could explain:

  • the disappearance of scheduler memory growth, and
  • the significant increase in TTFT and drop in cache hit rate?

2. Configuration or tuning for newer versions

Is it possible to use sglang 0.5.6+ while achieving TTFT comparable to older versions (e.g., 0.5.5)?

If so, are there recommended configuration changes or tuning options that should be applied when upgrading?

Additional Context

  • The issue is observed under long-running, sustained benchmark traffic, not short or bursty tests.
  • The benchmark workload, runtime environment, and startup configuration are identical across all tested versions.

Related resources

No response

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions