Checklist
Describe the bug
Running the SGLang server with --enable-hierarchical-cache fails with an AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'. The error occurs within the cache loading thread (load_thread_func_layer_by_layer) when accessing the memory pool (memory_pool.py, line 955).
Bug Description:
The issue persists even when Data Parallelism is set to 1 (--dp=1). However, the server launches successfully if --enable-hierarchical-cache and related flags are removed, confirming the problem lies specifically with the hierarchical caching implementation.
Expected Behavior:
The SGLang server should run successfully without errors when hierarchical caching is enabled during the bench_multiturn executions.
Observed Behavior (Error Traceback):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
return self.kv_buffer[:, layer_id - self.start_layer, indices]
Traceback (most recent call last):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
return self.kv_buffer[:, layer_id - self.start_layer, indices]
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
return self.kv_buffer[:, layer_id - self.start_layer, indices]
return self.kv_buffer[:, layer_id - self.start_layer, indices]
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
Additional Context:
- Confirmed that running without
--enable-hierarchical-cache (using only --tp=4) works correctly.
- Confirmed the error occurs with both
--dp=1 and --dp=2 when hierarchical cache is enabled.
Reproduction
- Check out SGLang commit:
c5645e928f0bf989510dcd707d31249c63c57e37
- Have the Qwen3-14B model available (e.g., at
~/models/Qwen3-14B).
- Run the following command:
python3 -m sglang.launch_server --model-path ~/models/Qwen3-14B --port 30000 \
--enable-hierarchical-cache \
--mem-fraction-static 0.8 \
--hicache-ratio 2 \
--enable-cache-report \
--enable-metrics \
--tp=4 \
--dp=1
python3 bench_multiturn.py --model-path ~/models/Qwen3-14B \
--dataset-path
~/models/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
Environment
SGLang Version/Commit: c5645e928f0bf989510dcd707d31249c63c57e37
Model: Qwen3-14B
PyTorch Version: 2.6.0
CUDA Version: 12.4
GPU Model: NVIDIA A10
Operating System: Ubuntu 22.04.5 LTS
Checklist
Describe the bug
Running the SGLang server with
--enable-hierarchical-cachefails with anAttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'. The error occurs within the cache loading thread (load_thread_func_layer_by_layer) when accessing the memory pool (memory_pool.py, line 955).Bug Description:
The issue persists even when Data Parallelism is set to 1 (
--dp=1). However, the server launches successfully if--enable-hierarchical-cacheand related flags are removed, confirming the problem lies specifically with the hierarchical caching implementation.Expected Behavior:
The SGLang server should run successfully without errors when hierarchical caching is enabled during the bench_multiturn executions.
Observed Behavior (Error Traceback):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
return self.kv_buffer[:, layer_id - self.start_layer, indices]
Traceback (most recent call last):
Exception in thread Thread-13 (load_thread_func_layer_by_layer):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
return self.kv_buffer[:, layer_id - self.start_layer, indices]
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
self._target(*self._args, **self._kwargs)
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
File "/root/sglang/python/sglang/srt/managers/cache_controller.py", line 338, in load_thread_func_layer_by_layer
flat_data = self.mem_pool_host.get_flat_data_by_layer(
File "/root/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 955, in get_flat_data_by_layer
return self.kv_buffer[:, layer_id - self.start_layer, indices]
return self.kv_buffer[:, layer_id - self.start_layer, indices]
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
AttributeError: 'MHATokenToKVPoolHost' object has no attribute 'start_layer'
Additional Context:
--enable-hierarchical-cache(using only--tp=4) works correctly.--dp=1and--dp=2when hierarchical cache is enabled.Reproduction
c5645e928f0bf989510dcd707d31249c63c57e37~/models/Qwen3-14B).Environment
SGLang Version/Commit:
c5645e928f0bf989510dcd707d31249c63c57e37Model: Qwen3-14B
PyTorch Version: 2.6.0
CUDA Version: 12.4
GPU Model: NVIDIA A10
Operating System: Ubuntu 22.04.5 LTS