I am observing that GPU memory increases by 20GB after SGLang server initialization when left running for 7 days and serving a lot of requests (between 500k and 1 million). I am trying to navigate and find the root reason for it. Should I clear the radix cache periodically somehow ?
I used the Torch memory profiler to capture snapshots of the processes accumulating memory and only found a lot of small allocations with this callstack which do not get deallocated ever:
48594 Addr: b'7408ef6eaa00_3, Size: 6.1KiB (6280 bytes) allocation, Total memory used after allocation: 43.2GiB (46407519728 bytes), stream 0, timestamp Thu May 01 2025 17:12:03 GMT+0300 (Eastern European Summer Time)
CUDACachingAllocator.cpp:0:c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::malloc(signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long)
:0:at::TensorBase at::detail::_empty_strided_generic<c10::ArrayRef<long> >(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType)
??:0:at::detail::empty_strided_generic(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType)
??:0:at::detail::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>)
??:0:at::detail::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
??:0:at::native::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
RegisterCUDA.cpp:0:at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__empty_strided(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
RegisterCUDA.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__empty_strided>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
??:0:at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
RegisterBackendSelect.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::empty_strided>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
??:0:at::_ops::empty_strided::call(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)
??:0:at::native::clone(at::Tensor const&, std::optional<c10::MemoryFormat>)
RegisterCompositeExplicitAutograd.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__clone>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>)
??:0:at::_ops::clone::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>)
VariableType_1.cpp:0:torch::autograd::VariableType::(anonymous namespace)::clone(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>)
VariableType_1.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::clone>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat> > >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::MemoryFormat>)
??:0:at::_ops::clone::call(at::Tensor const&, std::optional<c10::MemoryFormat>)
python_variable_methods.cpp:0:torch::autograd::THPVariable_clone(_object*, _object*, _object*)
/usr/local/src/conda/python-3.11.11/Objects/descrobject.c:364:method_vectorcall_VARARGS_KEYWORDS
/home/ubuntu/tts.cpp/sglang_repo/python/sglang/srt/mem_cache/radix_cache.py:189:cache_finished_req
/home/ubuntu/tts.cpp/sglang_repo/python/sglang/srt/managers/scheduler_output_processor_mixin.py:237:process_batch_result_decode
/home/ubuntu/tts.cpp/sglang_repo/python/sglang/srt/managers/scheduler.py:1453:process_batch_result
/home/ubuntu/tts.cpp/sglang_repo/python/sglang/srt/managers/scheduler.py:656:event_loop_overlap
/home/ubuntu/miniconda3/envs/sgl/lib/python3.11/site-packages/torch/utils/_contextlib.py:116:decorate_context
/home/ubuntu/tts.cpp/sglang_repo/python/sglang/srt/managers/scheduler.py:2057:run_scheduler_process
/home/ubuntu/miniconda3/envs/sgl/lib/python3.11/multiprocessing/process.py:108:run
/home/ubuntu/miniconda3/envs/sgl/lib/python3.11/multiprocessing/process.py:314:_bootstrap
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --mem-fraction-static 0.55 --quantization fp8 --schedule-conservativeness 0.3
I also use return_hidden_state true, but even without it I still observe the same.
Python: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.05
PyTorch: 2.5.1+cu124
sglang: 0.4.5
sgl_kernel: 0.0.8
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
outlines: 0.1.11
packaging: 24.2
psutil: 5.9.8
pydantic: 2.10.6
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
xgrammar: 0.1.17
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.46.0
litellm: 1.61.11
NVIDIA Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-25 0 N/A
NIC0 PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
Hypervisor vendor: KVM
ulimit soft: 1048576
Checklist
Describe the bug
I am observing that GPU memory increases by 20GB after SGLang server initialization when left running for 7 days and serving a lot of requests (between 500k and 1 million). I am trying to navigate and find the root reason for it. Should I clear the radix cache periodically somehow ?
I used the Torch memory profiler to capture snapshots of the processes accumulating memory and only found a lot of small allocations with this callstack which do not get deallocated ever:
Here is the Torch Profiler snapshot if helpful in any way https://drive.google.com/file/d/1oe9FjuEukhzhcNQwWVcmWmj-xd4ff9si/view?usp=sharing
You can use https://pytorch.org/memory_viz to inspect it.
Reproduction
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --mem-fraction-static 0.55 --quantization fp8 --schedule-conservativeness 0.3
I also use return_hidden_state true, but even without it I still observe the same.
Environment