Skip to content

[Bug]: E0306 11:12:10.776882 116419 replica.h:281] Invalid replica type: DISK #1618

@WingEdge777

Description

@WingEdge777

Bug Report

I was testing sglang with HiCache + mooncake L3(DFS enabled)

The test deployment plan is sgl router + two SGLang servers with HiCache + mooncake L3(+ persistent storage). I was conducting this test within a Docker container.

server prepare

# mooncacke master
mooncake_master --enable_http_metadata_server=true --http_metadata_server_port=8080 --eviction_high_watermark_ratio=0.95  --root_fs_dir /mnt/data-cbs

export MOONCAKE_TE_META_DATA_SERVER="http://127.0.0.1:8080/metadata" 
export MOONCAKE_MASTER="127.0.0.1:50051" 
export MOONCAKE_PROTOCOL="tcp" 
export MOONCAKE_DEVICE="" 
export MOONCAKE_GLOBAL_SEGMENT_SIZE="16gb" # 每个sglang server贡献16G

# sglang server
CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path Qwen3-8B --tp 1 --mem-fraction-static 0.6 --watchdog-timeout 1000 --host 0.0.0.0 --port 30001 --enable-hierarchical-cache --hicache-storage-backend mooncake
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server --model-path Qwen3-8B --tp 1 --mem-fraction-static 0.6 --watchdog-timeout 1000 --host 0.0.0.0 --port 30002 --enable-hierarchical-cache --hicache-storage-backend mooncake

# router
python -m sglang_router.launch_router \
  --worker-urls http://127.0.0.1:30001 http://127.0.0.1:30002 \
  --policy round_robin \
  --host 0.0.0.0 --port 30000

benchmark clients:

# cd /sgl-workspace/sglang/benchmark/hicache,
python bench_multiturn.py --model-path /root/workspace/data_dir/Qwen3-8B --host 127.0.0.1 --disable-auto-run --request-rate 16 --request-length 1024 --output-length 64

the master service error log( seems reported by discarding replica)

E0306 11:12:10.776895 116419 replica.h:281] Invalid replica type: DISK
E0306 11:12:10.776903 116419 replica.h:281] Invalid replica type: DISK
E0306 11:12:10.776911 116419 replica.h:281] Invalid replica type: DISK
E0306 11:12:10.776917 116419 replica.h:281] Invalid replica type: DISK
E0306 11:12:10.776937 116419 replica.h:281] Invalid replica type: DISK
E0306 11:12:10.776943 116419 replica.h:281] Invalid replica type: DISK

client-side log:

#Output tokens: 65536
 78%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                        | 995/1280 [03:36<00:52,  5.48it/s]Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
 79%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                      | 1010/1280 [03:41<01:09,  3.86it/s]Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Error processing response for client 222: Request failed with error: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Error processing response for client 103: Request failed with error: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Request failed: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>
Error processing response for client 17: Request failed with error: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data to satisfy transfer length header.'>

Is this error expected (client-side also reports errors, which harms the user experience)?
And I did not limit the SSD max segment size usage. Why did discarding disk storage happen?

Before submitting...

  • Ensure you searched for relevant issues and read the [documentation]

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions