[optimize] boost local_disk_backend submit_put_task performance by llc-kc · Pull Request #912 · LMCache/LMCache

llc-kc · 2025-06-26T04:45:11Z

The cache store execution time mainly contains three parts: alloc, cpu-gpu copy, submit_put_task.
Existing local_disk_backend submit_put_task do a lot works that typically take more than 1ms execution time, which reduce throughput much. For vllm serving benchmark, cpu+local disk offload benchmark duration is typically longer than cpu offload only.
This PR using a thread to do the put task, enable submit_put_task return immediately, results in similar vllm serving benchmark duration between cpu+local disk offload and cpu offload only.
Note submit_put_task does not return future, but this is actually not used, thus return None is ok.

YaoJiayi · 2025-06-26T23:14:19Z

@llc-kc Can you show some numbers?

llc-kc · 2025-06-27T04:25:23Z

@YaoJiayi here are some benchmark method and results:

Benchmark method

launch vllm serving:

rm -r /DATA/disk1/lmcache_tmp
mkdir /DATA/disk1/lmcache_tmp

export LMCACHE_CONFIG_FILE=lmcache-config.yaml

vllm serve Qwen/Qwen3-32B-FP8 \
    -tp 8 \
    --disable-log-requests \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_connector_extra_config": {"discard_partial_chunks": true}}'

lmcache-config.yaml

local_cpu: True
max_local_cpu_size: 20
local_disk: /DATA/disk1/lmcache_tmp/
max_local_disk_size: 500
save_unfull_chunk: false

benchmark:

python3 -m sglang.bench_serving \
    --backend vllm \
    --model Qwen/Qwen3-32B-FP8 \
    --dataset-name random \
    --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1024 \
    --random-input 2048 \
    --random-output 256 \
    --random-range-ratio 1.0 \
    --request-rate 256 \
    --max-concurrency 256 \
    --base-url http://localhost:8000

results of no offload

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  83.59
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262124
Request throughput (req/s):              12.25
Input token throughput (tok/s):          25089.91
Output token throughput (tok/s):         3136.24
Total token throughput (tok/s):          28226.14
Concurrency:                             252.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20626.27
Median E2E Latency (ms):                 20731.77
---------------Time to First Token----------------
Mean TTFT (ms):                          2326.08
Median TTFT (ms):                        712.27
P99 TTFT (ms):                           13910.68
---------------Inter-Token Latency----------------
Mean ITL (ms):                           71.82
Median ITL (ms):                         25.77
P95 ITL (ms):                            239.69
P99 ITL (ms):                            241.73
Max ITL (ms):                            721.82

results of CPU offload only

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  91.29
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262131
Request throughput (req/s):              11.22
Input token throughput (tok/s):          22971.18
Output token throughput (tok/s):         2871.40
Total token throughput (tok/s):          25842.58
Concurrency:                             252.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22550.03
Median E2E Latency (ms):                 22410.29
---------------Time to First Token----------------
Mean TTFT (ms):                          2436.57
Median TTFT (ms):                        780.83
P99 TTFT (ms):                           14953.89
---------------Inter-Token Latency----------------
Mean ITL (ms):                           78.92
Median ITL (ms):                         26.36
P95 ITL (ms):                            263.63
P99 ITL (ms):                            268.47
Max ITL (ms):                            938.36

LMCache INFO: Store 1792 tokens takes: 2.2906 ms, throughput: 23.8751 GB/s; alloc_time: 0.1726 ms, offload_time: 2.2708 ms, put_time: 0.0198 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 2.3308 ms, throughput: 23.4632 GB/s; alloc_time: 0.1748 ms, offload_time: 2.3106 ms, put_time: 0.0202 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 3.0179 ms, throughput: 18.1208 GB/s; alloc_time: 0.1885 ms, offload_time: 2.9986 ms, put_time: 0.0193 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 2048 tokens takes: 2.0685 ms, throughput: 30.2145 GB/s; alloc_time: 0.1996 ms, offload_time: 2.0426 ms, put_time: 0.0259 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 3.0898 ms, throughput: 17.6996 GB/s; alloc_time: 0.1728 ms, offload_time: 3.0655 ms, put_time: 0.0243 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 2.6011 ms, throughput: 21.0245 GB/s; alloc_time: 0.1802 ms, offload_time: 2.5816 ms, put_time: 0.0195 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 2048 tokens takes: 2.0933 ms, throughput: 29.8571 GB/s; alloc_time: 0.1755 ms, offload_time: 2.0728 ms, put_time: 0.0205 ms (cache_engine.py:225:lmcache.v1.cache_engine)

LMCache INFO: Store 256 tokens takes: 0.5753 ms, throughput: 13.5796 GB/s; alloc_time: 0.1689 ms, offload_time: 0.5664 ms, put_time: 0.0089 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.5625 ms, throughput: 13.8884 GB/s; alloc_time: 0.1702 ms, offload_time: 0.5531 ms, put_time: 0.0094 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.5566 ms, throughput: 14.0352 GB/s; alloc_time: 0.1755 ms, offload_time: 0.5473 ms, put_time: 0.0093 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.6428 ms, throughput: 12.1534 GB/s; alloc_time: 0.1714 ms, offload_time: 0.6340 ms, put_time: 0.0088 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.7112 ms, throughput: 10.9849 GB/s; alloc_time: 0.1769 ms, offload_time: 0.7018 ms, put_time: 0.0094 ms (cache_engine.py:225:lmcache.v1.cache_engine)

results of CPU + DISK offload, no optimization

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  123.62
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262132
Request throughput (req/s):              8.28
Input token throughput (tok/s):          16965.12
Output token throughput (tok/s):         2120.64
Total token throughput (tok/s):          19085.76
Concurrency:                             253.66
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   30621.50
Median E2E Latency (ms):                 31144.50
---------------Time to First Token----------------
Mean TTFT (ms):                          3640.03
Median TTFT (ms):                        1092.86
P99 TTFT (ms):                           22376.24
---------------Inter-Token Latency----------------
Mean ITL (ms):                           105.88
Median ITL (ms):                         26.52
P95 ITL (ms):                            386.40
P99 ITL (ms):                            576.30
Max ITL (ms):                            1945.37

LMCache INFO: Store 1792 tokens takes: 4.7224 ms, throughput: 11.5805 GB/s; alloc_time: 0.8652 ms, offload_time: 3.6815 ms, put_time: 1.0409 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 7.6667 ms, throughput: 7.1331 GB/s; alloc_time: 0.7455 ms, offload_time: 3.2387 ms, put_time: 4.4280 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 768 tokens takes: 3.3960 ms, throughput: 6.9016 GB/s; alloc_time: 0.6998 ms, offload_time: 2.2951 ms, put_time: 1.1009 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 8.4138 ms, throughput: 6.4997 GB/s; alloc_time: 0.5617 ms, offload_time: 6.8491 ms, put_time: 1.5648 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 7.1763 ms, throughput: 7.6206 GB/s; alloc_time: 3.1958 ms, offload_time: 4.1645 ms, put_time: 3.0118 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 5.3449 ms, throughput: 10.2317 GB/s; alloc_time: 2.1824 ms, offload_time: 4.2883 ms, put_time: 1.0566 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 768 tokens takes: 7.9802 ms, throughput: 2.9370 GB/s; alloc_time: 0.3123 ms, offload_time: 6.9432 ms, put_time: 1.0370 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 2.3368 ms, throughput: 3.3432 GB/s; alloc_time: 1.0244 ms, offload_time: 1.9469 ms, put_time: 0.3899 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 2.8265 ms, throughput: 2.7640 GB/s; alloc_time: 1.3651 ms, offload_time: 2.3192 ms, put_time: 0.5074 ms (cache_engine.py:225:lmcache.v1.cache_engine)

results of CPU + DISK offload, with optimization

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  92.42
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262130
Request throughput (req/s):              11.08
Input token throughput (tok/s):          22691.72
Output token throughput (tok/s):         2836.47
Total token throughput (tok/s):          25528.19
Concurrency:                             252.97
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22831.25
Median E2E Latency (ms):                 22689.21
---------------Time to First Token----------------
Mean TTFT (ms):                          2469.73
Median TTFT (ms):                        787.25
P99 TTFT (ms):                           15292.39
---------------Inter-Token Latency----------------
Mean ITL (ms):                           79.90
Median ITL (ms):                         26.43
P95 ITL (ms):                            267.05
P99 ITL (ms):                            285.64
Max ITL (ms):                            801.35

LMCache INFO: Store 2048 tokens takes: 3.6619 ms, throughput: 17.0675 GB/s; alloc_time: 0.2059 ms, offload_time: 3.6161 ms, put_time: 0.0458 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 3.5421 ms, throughput: 17.6447 GB/s; alloc_time: 0.2087 ms, offload_time: 3.4940 ms, put_time: 0.0481 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 3.6094 ms, throughput: 17.3157 GB/s; alloc_time: 0.2188 ms, offload_time: 3.5627 ms, put_time: 0.0468 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.0662 ms, throughput: 15.3705 GB/s; alloc_time: 0.2238 ms, offload_time: 4.0157 ms, put_time: 0.0506 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.8939 ms, throughput: 12.7711 GB/s; alloc_time: 0.1776 ms, offload_time: 4.8492 ms, put_time: 0.0447 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.9371 ms, throughput: 12.6593 GB/s; alloc_time: 0.1785 ms, offload_time: 4.8899 ms, put_time: 0.0471 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   
LMCache INFO: Store 256 tokens takes: 0.9619 ms, throughput: 8.1224 GB/s; alloc_time: 0.1864 ms, offload_time: 0.9429 ms, put_time: 0.0189 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.9669 ms, throughput: 8.0800 GB/s; alloc_time: 0.2019 ms, offload_time: 0.9474 ms, put_time: 0.0195 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.9654 ms, throughput: 8.0925 GB/s; alloc_time: 0.2115 ms, offload_time: 0.9440 ms, put_time: 0.0214 ms (cache_engine.py:225:lmcache.v1.cache_engine)

llc-kc · 2025-06-27T04:26:48Z

Therefore, this PR effectively optimize the performance of disk offloading by reducing the put time.

YaoJiayi

Good work! Just left some comments/questions:)

YaoJiayi · 2025-07-01T04:01:00Z


        self.loop = loop
-        self.put_tasks: List[CacheEngineKey] = []
+        self.put_tasks: OrderedDict[CacheEngineKey, MemoryObj] = OrderedDict()


Why ordered dict here?

First, we need a dict to store both key and MemoryObj, second, we can store based on put order.

YaoJiayi · 2025-07-01T04:47:14Z

            )

+    def _put_task_worker(self):
+        while not self.closed:


A question: why threading is faster than asyncio?

Shouldn't asyncio be lighter than threads?

The reason why this optimization works are: first, there are some processes before async_save_bytes_to_disk, second, the asyncio.run_coroutine_threadsafe may have extra overhead. Your comments remind me of another solution: just move all works in submit_put_task to the async function. But this solution performance is not good, benchmark duration is similar to no optimization.
The other solution codes:

async def async_put_task( self, key: CacheEngineKey, memory_obj: MemoryObj, ) -> None: # Update cache recency evict_keys, put_status = self.evictor.update_on_put( self.dict, memory_obj.get_physical_size() ) if put_status == PutStatus.ILLEGAL: return None # evict caches for evict_key in evict_keys: self.remove(evict_key) if self.lookup_server is not None: self.lookup_server.batched_remove(evict_keys) memory_obj.ref_count_up() self.disk_lock.acquire() self.put_tasks.append(key) self.disk_lock.release() await self.async_save_bytes_to_disk(key, memory_obj) def submit_put_task( self, key: CacheEngineKey, memory_obj: MemoryObj, ) -> Optional[Future]: assert memory_obj.tensor is not None future = asyncio.run_coroutine_threadsafe( self.async_put_task(key, memory_obj), self.loop ) return future

llc-kc · 2025-07-04T08:24:16Z

@YaoJiayi Hi, how do you think about this pr? Currently, a check is failed, but I can't see the details.

Shaoting-Feng · 2025-07-07T17:59:06Z

@YaoJiayi Hi, how do you think about this pr? Currently, a check is failed, but I can't see the details.

Please ignore this check. It is only experimental.

llc-kc · 2025-07-08T07:31:00Z

code conflict is fixed in latest commit

Signed-off-by: liluchang <liluchang@kingsoft.com>

llc-kc force-pushed the llc_test branch 5 times, most recently from 85e71b2 to 49f3715 Compare June 26, 2025 13:16

YaoJiayi mentioned this pull request Jun 30, 2025

[Core] Add batched get interface #924

Merged

llc-kc force-pushed the llc_test branch from a0e845b to ba14f35 Compare June 30, 2025 03:25

YaoJiayi reviewed Jul 1, 2025

View reviewed changes

llc-kc force-pushed the llc_test branch 3 times, most recently from 36af5c0 to 64fdeac Compare July 2, 2025 08:14

Shaoting-Feng assigned YaoJiayi Jul 7, 2025

Shaoting-Feng requested a review from YaoJiayi July 7, 2025 17:59

llc-kc force-pushed the llc_test branch 2 times, most recently from 875ea2e to 62960a5 Compare July 8, 2025 07:29

llc-kc force-pushed the llc_test branch from 62960a5 to abc4b1e Compare July 8, 2025 09:13

[optimize] boost local_disk_backend submit_put_task performance

283be21

Signed-off-by: liluchang <liluchang@kingsoft.com>

llc-kc force-pushed the llc_test branch from abc4b1e to 283be21 Compare July 15, 2025 06:30

llc-kc closed this Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[optimize] boost local_disk_backend submit_put_task performance#912

[optimize] boost local_disk_backend submit_put_task performance#912
llc-kc wants to merge 1 commit intoLMCache:devfrom
llc-kc:llc_test

llc-kc commented Jun 26, 2025 •

edited

Loading

Uh oh!

YaoJiayi commented Jun 26, 2025

Uh oh!

llc-kc commented Jun 27, 2025 •

edited

Loading

Uh oh!

llc-kc commented Jun 27, 2025 •

edited

Loading

Uh oh!

YaoJiayi left a comment

Uh oh!

YaoJiayi Jul 1, 2025

Uh oh!

llc-kc Jul 1, 2025

Uh oh!

YaoJiayi Jul 1, 2025

Uh oh!

YaoJiayi Jul 1, 2025

Uh oh!

llc-kc Jul 1, 2025 •

edited

Loading

Uh oh!

llc-kc commented Jul 4, 2025

Uh oh!

Shaoting-Feng commented Jul 7, 2025

Uh oh!

llc-kc commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

llc-kc commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YaoJiayi commented Jun 26, 2025

Uh oh!

llc-kc commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark method

results of no offload

results of CPU offload only

results of CPU + DISK offload, no optimization

results of CPU + DISK offload, with optimization

Uh oh!

llc-kc commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YaoJiayi left a comment

Choose a reason for hiding this comment

Uh oh!

YaoJiayi Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

llc-kc Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

YaoJiayi Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

YaoJiayi Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

llc-kc Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llc-kc commented Jul 4, 2025

Uh oh!

Shaoting-Feng commented Jul 7, 2025

Uh oh!

llc-kc commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llc-kc commented Jun 26, 2025 •

edited

Loading

llc-kc commented Jun 27, 2025 •

edited

Loading

llc-kc commented Jun 27, 2025 •

edited

Loading

llc-kc Jul 1, 2025 •

edited

Loading