Conversation
|
It's amazing! Happy new year! |
vanilla and selective write through
… transfer overhead
2bd500a to
1853cf2
Compare
|
DeepSeek MLA is not supported yet, and an error will be reported when starting the model: |
Thank you @lambert0312 for pointing out, yes, this feature is still under meta stage and currently only supported MHA and GQA style memory pool. I will keep you posted once MLA is supported, which should be soon. |
Thanks @xiezhq-hermann |
@lambert0312 just FYI, there is a PR from the community supporting MLA with hierarchical caching, which will be merged soon but feel free to check it out: #4009 |
@xiezhq-hermann Thanks, but I've encountered a problem. I just experimented with #4009 and found that there is indeed a concurrency problem when TP>1. The program will enter a locked state. There may be a concurrency problem. Please follow up. Thank you! |
Besides |
Right now it allocate a host memory pool which is 4 times of the size of the device memory pool by default, so no need to set other things but more options will be added. |
|
Hi I'm wondering - when are you planning to support L3 cache? I think it's reasonable to support pluggable L3 caches, which encourages storage providers to implement their L3 caches according to their product features. What you need to do is to define a bunch of kv cache apis for getting/putting/evicting kv cache chunk/item and give them some demo implentation using something like local SSD. |
|
This is in works @wangyibin-gh ! |
when do you expect this feature can be merged? and btw is there any documentation about it, especially w.r.t the APIs. |

Motivation
While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.
To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.
Modifications
HiRadixCachethat extendsRadixCachewith host memory addresses and synchronization mechanisms.Todo:
Checklist