Skip to content

[optimize] boost local_disk_backend submit_put_task performance#912

Closed
llc-kc wants to merge 1 commit intoLMCache:devfrom
llc-kc:llc_test
Closed

[optimize] boost local_disk_backend submit_put_task performance#912
llc-kc wants to merge 1 commit intoLMCache:devfrom
llc-kc:llc_test

Conversation

@llc-kc
Copy link
Copy Markdown
Contributor

@llc-kc llc-kc commented Jun 26, 2025

The cache store execution time mainly contains three parts: alloc, cpu-gpu copy, submit_put_task.
Existing local_disk_backend submit_put_task do a lot works that typically take more than 1ms execution time, which reduce throughput much. For vllm serving benchmark, cpu+local disk offload benchmark duration is typically longer than cpu offload only.
This PR using a thread to do the put task, enable submit_put_task return immediately, results in similar vllm serving benchmark duration between cpu+local disk offload and cpu offload only.
Note submit_put_task does not return future, but this is actually not used, thus return None is ok.

@llc-kc llc-kc force-pushed the llc_test branch 5 times, most recently from 85e71b2 to 49f3715 Compare June 26, 2025 13:16
@YaoJiayi
Copy link
Copy Markdown
Collaborator

@llc-kc Can you show some numbers?

@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Jun 27, 2025

@YaoJiayi here are some benchmark method and results:

Benchmark method

launch vllm serving:

rm -r /DATA/disk1/lmcache_tmp
mkdir /DATA/disk1/lmcache_tmp

export LMCACHE_CONFIG_FILE=lmcache-config.yaml

vllm serve Qwen/Qwen3-32B-FP8 \
    -tp 8 \
    --disable-log-requests \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_connector_extra_config": {"discard_partial_chunks": true}}'

lmcache-config.yaml

local_cpu: True
max_local_cpu_size: 20
local_disk: /DATA/disk1/lmcache_tmp/
max_local_disk_size: 500
save_unfull_chunk: false

benchmark:

python3 -m sglang.bench_serving \
    --backend vllm \
    --model Qwen/Qwen3-32B-FP8 \
    --dataset-name random \
    --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1024 \
    --random-input 2048 \
    --random-output 256 \
    --random-range-ratio 1.0 \
    --request-rate 256 \
    --max-concurrency 256 \
    --base-url http://localhost:8000

results of no offload

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  83.59
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262124
Request throughput (req/s):              12.25
Input token throughput (tok/s):          25089.91
Output token throughput (tok/s):         3136.24
Total token throughput (tok/s):          28226.14
Concurrency:                             252.69
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20626.27
Median E2E Latency (ms):                 20731.77
---------------Time to First Token----------------
Mean TTFT (ms):                          2326.08
Median TTFT (ms):                        712.27
P99 TTFT (ms):                           13910.68
---------------Inter-Token Latency----------------
Mean ITL (ms):                           71.82
Median ITL (ms):                         25.77
P95 ITL (ms):                            239.69
P99 ITL (ms):                            241.73
Max ITL (ms):                            721.82

results of CPU offload only

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  91.29
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262131
Request throughput (req/s):              11.22
Input token throughput (tok/s):          22971.18
Output token throughput (tok/s):         2871.40
Total token throughput (tok/s):          25842.58
Concurrency:                             252.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22550.03
Median E2E Latency (ms):                 22410.29
---------------Time to First Token----------------
Mean TTFT (ms):                          2436.57
Median TTFT (ms):                        780.83
P99 TTFT (ms):                           14953.89
---------------Inter-Token Latency----------------
Mean ITL (ms):                           78.92
Median ITL (ms):                         26.36
P95 ITL (ms):                            263.63
P99 ITL (ms):                            268.47
Max ITL (ms):                            938.36

LMCache INFO: Store 1792 tokens takes: 2.2906 ms, throughput: 23.8751 GB/s; alloc_time: 0.1726 ms, offload_time: 2.2708 ms, put_time: 0.0198 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 2.3308 ms, throughput: 23.4632 GB/s; alloc_time: 0.1748 ms, offload_time: 2.3106 ms, put_time: 0.0202 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 3.0179 ms, throughput: 18.1208 GB/s; alloc_time: 0.1885 ms, offload_time: 2.9986 ms, put_time: 0.0193 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 2048 tokens takes: 2.0685 ms, throughput: 30.2145 GB/s; alloc_time: 0.1996 ms, offload_time: 2.0426 ms, put_time: 0.0259 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 3.0898 ms, throughput: 17.6996 GB/s; alloc_time: 0.1728 ms, offload_time: 3.0655 ms, put_time: 0.0243 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 2.6011 ms, throughput: 21.0245 GB/s; alloc_time: 0.1802 ms, offload_time: 2.5816 ms, put_time: 0.0195 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 2048 tokens takes: 2.0933 ms, throughput: 29.8571 GB/s; alloc_time: 0.1755 ms, offload_time: 2.0728 ms, put_time: 0.0205 ms (cache_engine.py:225:lmcache.v1.cache_engine)

LMCache INFO: Store 256 tokens takes: 0.5753 ms, throughput: 13.5796 GB/s; alloc_time: 0.1689 ms, offload_time: 0.5664 ms, put_time: 0.0089 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.5625 ms, throughput: 13.8884 GB/s; alloc_time: 0.1702 ms, offload_time: 0.5531 ms, put_time: 0.0094 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.5566 ms, throughput: 14.0352 GB/s; alloc_time: 0.1755 ms, offload_time: 0.5473 ms, put_time: 0.0093 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.6428 ms, throughput: 12.1534 GB/s; alloc_time: 0.1714 ms, offload_time: 0.6340 ms, put_time: 0.0088 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.7112 ms, throughput: 10.9849 GB/s; alloc_time: 0.1769 ms, offload_time: 0.7018 ms, put_time: 0.0094 ms (cache_engine.py:225:lmcache.v1.cache_engine)

results of CPU + DISK offload, no optimization

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  123.62
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262132
Request throughput (req/s):              8.28
Input token throughput (tok/s):          16965.12
Output token throughput (tok/s):         2120.64
Total token throughput (tok/s):          19085.76
Concurrency:                             253.66
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   30621.50
Median E2E Latency (ms):                 31144.50
---------------Time to First Token----------------
Mean TTFT (ms):                          3640.03
Median TTFT (ms):                        1092.86
P99 TTFT (ms):                           22376.24
---------------Inter-Token Latency----------------
Mean ITL (ms):                           105.88
Median ITL (ms):                         26.52
P95 ITL (ms):                            386.40
P99 ITL (ms):                            576.30
Max ITL (ms):                            1945.37

LMCache INFO: Store 1792 tokens takes: 4.7224 ms, throughput: 11.5805 GB/s; alloc_time: 0.8652 ms, offload_time: 3.6815 ms, put_time: 1.0409 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 7.6667 ms, throughput: 7.1331 GB/s; alloc_time: 0.7455 ms, offload_time: 3.2387 ms, put_time: 4.4280 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 768 tokens takes: 3.3960 ms, throughput: 6.9016 GB/s; alloc_time: 0.6998 ms, offload_time: 2.2951 ms, put_time: 1.1009 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 8.4138 ms, throughput: 6.4997 GB/s; alloc_time: 0.5617 ms, offload_time: 6.8491 ms, put_time: 1.5648 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 7.1763 ms, throughput: 7.6206 GB/s; alloc_time: 3.1958 ms, offload_time: 4.1645 ms, put_time: 3.0118 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 1792 tokens takes: 5.3449 ms, throughput: 10.2317 GB/s; alloc_time: 2.1824 ms, offload_time: 4.2883 ms, put_time: 1.0566 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 768 tokens takes: 7.9802 ms, throughput: 2.9370 GB/s; alloc_time: 0.3123 ms, offload_time: 6.9432 ms, put_time: 1.0370 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 2.3368 ms, throughput: 3.3432 GB/s; alloc_time: 1.0244 ms, offload_time: 1.9469 ms, put_time: 0.3899 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 2.8265 ms, throughput: 2.7640 GB/s; alloc_time: 1.3651 ms, offload_time: 2.3192 ms, put_time: 0.5074 ms (cache_engine.py:225:lmcache.v1.cache_engine)

results of CPU + DISK offload, with optimization

Backend:                                 vllm
Traffic request rate:                    256.0
Max request concurrency:                 256
Successful requests:                     1024
Benchmark duration (s):                  92.42
Total input tokens:                      2097152
Total generated tokens:                  262144
Total generated tokens (retokenized):    262130
Request throughput (req/s):              11.08
Input token throughput (tok/s):          22691.72
Output token throughput (tok/s):         2836.47
Total token throughput (tok/s):          25528.19
Concurrency:                             252.97
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22831.25
Median E2E Latency (ms):                 22689.21
---------------Time to First Token----------------
Mean TTFT (ms):                          2469.73
Median TTFT (ms):                        787.25
P99 TTFT (ms):                           15292.39
---------------Inter-Token Latency----------------
Mean ITL (ms):                           79.90
Median ITL (ms):                         26.43
P95 ITL (ms):                            267.05
P99 ITL (ms):                            285.64
Max ITL (ms):                            801.35

LMCache INFO: Store 2048 tokens takes: 3.6619 ms, throughput: 17.0675 GB/s; alloc_time: 0.2059 ms, offload_time: 3.6161 ms, put_time: 0.0458 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 3.5421 ms, throughput: 17.6447 GB/s; alloc_time: 0.2087 ms, offload_time: 3.4940 ms, put_time: 0.0481 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 3.6094 ms, throughput: 17.3157 GB/s; alloc_time: 0.2188 ms, offload_time: 3.5627 ms, put_time: 0.0468 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.0662 ms, throughput: 15.3705 GB/s; alloc_time: 0.2238 ms, offload_time: 4.0157 ms, put_time: 0.0506 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.8939 ms, throughput: 12.7711 GB/s; alloc_time: 0.1776 ms, offload_time: 4.8492 ms, put_time: 0.0447 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   e)
LMCache INFO: Store 2048 tokens takes: 4.9371 ms, throughput: 12.6593 GB/s; alloc_time: 0.1785 ms, offload_time: 4.8899 ms, put_time: 0.0471 ms (cache_engine.py:225:lmcache.v1.cache_engin                                   
LMCache INFO: Store 256 tokens takes: 0.9619 ms, throughput: 8.1224 GB/s; alloc_time: 0.1864 ms, offload_time: 0.9429 ms, put_time: 0.0189 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.9669 ms, throughput: 8.0800 GB/s; alloc_time: 0.2019 ms, offload_time: 0.9474 ms, put_time: 0.0195 ms (cache_engine.py:225:lmcache.v1.cache_engine)
LMCache INFO: Store 256 tokens takes: 0.9654 ms, throughput: 8.0925 GB/s; alloc_time: 0.2115 ms, offload_time: 0.9440 ms, put_time: 0.0214 ms (cache_engine.py:225:lmcache.v1.cache_engine)

@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Jun 27, 2025

Therefore, this PR effectively optimize the performance of disk offloading by reducing the put time.

Copy link
Copy Markdown
Collaborator

@YaoJiayi YaoJiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Just left some comments/questions:)


self.loop = loop
self.put_tasks: List[CacheEngineKey] = []
self.put_tasks: OrderedDict[CacheEngineKey, MemoryObj] = OrderedDict()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ordered dict here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, we need a dict to store both key and MemoryObj, second, we can store based on put order.

)

def _put_task_worker(self):
while not self.closed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question: why threading is faster than asyncio?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't asyncio be lighter than threads?

Copy link
Copy Markdown
Contributor Author

@llc-kc llc-kc Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why this optimization works are: first, there are some processes before async_save_bytes_to_disk, second, the asyncio.run_coroutine_threadsafe may have extra overhead. Your comments remind me of another solution: just move all works in submit_put_task to the async function. But this solution performance is not good, benchmark duration is similar to no optimization.
The other solution codes:

    async def async_put_task(
        self,
        key: CacheEngineKey,
        memory_obj: MemoryObj,
    ) -> None:
        # Update cache recency
        evict_keys, put_status = self.evictor.update_on_put(
            self.dict, memory_obj.get_physical_size()
        )
        if put_status == PutStatus.ILLEGAL:
            return None
        # evict caches
        for evict_key in evict_keys:
            self.remove(evict_key)
        if self.lookup_server is not None:
            self.lookup_server.batched_remove(evict_keys)

        memory_obj.ref_count_up()

        self.disk_lock.acquire()
        self.put_tasks.append(key)
        self.disk_lock.release()

        await self.async_save_bytes_to_disk(key, memory_obj)

    def submit_put_task(
        self,
        key: CacheEngineKey,
        memory_obj: MemoryObj,
    ) -> Optional[Future]:
        assert memory_obj.tensor is not None

        future = asyncio.run_coroutine_threadsafe(
            self.async_put_task(key, memory_obj), self.loop
        )
        return future

@llc-kc llc-kc force-pushed the llc_test branch 3 times, most recently from 36af5c0 to 64fdeac Compare July 2, 2025 08:14
@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Jul 4, 2025

@YaoJiayi Hi, how do you think about this pr? Currently, a check is failed, but I can't see the details.

@Shaoting-Feng
Copy link
Copy Markdown
Contributor

@YaoJiayi Hi, how do you think about this pr? Currently, a check is failed, but I can't see the details.

Please ignore this check. It is only experimental.

@Shaoting-Feng Shaoting-Feng requested a review from YaoJiayi July 7, 2025 17:59
@llc-kc llc-kc force-pushed the llc_test branch 2 times, most recently from 875ea2e to 62960a5 Compare July 8, 2025 07:29
@llc-kc
Copy link
Copy Markdown
Contributor Author

llc-kc commented Jul 8, 2025

code conflict is fixed in latest commit

Signed-off-by: liluchang <liluchang@kingsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants