[optimize] boost local_disk_backend submit_put_task performance#912
[optimize] boost local_disk_backend submit_put_task performance#912llc-kc wants to merge 1 commit intoLMCache:devfrom
Conversation
85e71b2 to
49f3715
Compare
|
@llc-kc Can you show some numbers? |
|
@YaoJiayi here are some benchmark method and results: Benchmark methodlaunch vllm serving: lmcache-config.yaml benchmark: results of no offloadresults of CPU offload onlyresults of CPU + DISK offload, no optimizationresults of CPU + DISK offload, with optimization |
|
Therefore, this PR effectively optimize the performance of disk offloading by reducing the put time. |
YaoJiayi
left a comment
There was a problem hiding this comment.
Good work! Just left some comments/questions:)
|
|
||
| self.loop = loop | ||
| self.put_tasks: List[CacheEngineKey] = [] | ||
| self.put_tasks: OrderedDict[CacheEngineKey, MemoryObj] = OrderedDict() |
There was a problem hiding this comment.
First, we need a dict to store both key and MemoryObj, second, we can store based on put order.
| ) | ||
|
|
||
| def _put_task_worker(self): | ||
| while not self.closed: |
There was a problem hiding this comment.
A question: why threading is faster than asyncio?
There was a problem hiding this comment.
Shouldn't asyncio be lighter than threads?
There was a problem hiding this comment.
The reason why this optimization works are: first, there are some processes before async_save_bytes_to_disk, second, the asyncio.run_coroutine_threadsafe may have extra overhead. Your comments remind me of another solution: just move all works in submit_put_task to the async function. But this solution performance is not good, benchmark duration is similar to no optimization.
The other solution codes:
async def async_put_task(
self,
key: CacheEngineKey,
memory_obj: MemoryObj,
) -> None:
# Update cache recency
evict_keys, put_status = self.evictor.update_on_put(
self.dict, memory_obj.get_physical_size()
)
if put_status == PutStatus.ILLEGAL:
return None
# evict caches
for evict_key in evict_keys:
self.remove(evict_key)
if self.lookup_server is not None:
self.lookup_server.batched_remove(evict_keys)
memory_obj.ref_count_up()
self.disk_lock.acquire()
self.put_tasks.append(key)
self.disk_lock.release()
await self.async_save_bytes_to_disk(key, memory_obj)
def submit_put_task(
self,
key: CacheEngineKey,
memory_obj: MemoryObj,
) -> Optional[Future]:
assert memory_obj.tensor is not None
future = asyncio.run_coroutine_threadsafe(
self.async_put_task(key, memory_obj), self.loop
)
return future
36af5c0 to
64fdeac
Compare
|
@YaoJiayi Hi, how do you think about this pr? Currently, a check is failed, but I can't see the details. |
Please ignore this check. It is only experimental. |
875ea2e to
62960a5
Compare
|
code conflict is fixed in latest commit |
Signed-off-by: liluchang <liluchang@kingsoft.com>
The cache store execution time mainly contains three parts: alloc, cpu-gpu copy, submit_put_task.
Existing local_disk_backend submit_put_task do a lot works that typically take more than 1ms execution time, which reduce throughput much. For vllm serving benchmark, cpu+local disk offload benchmark duration is typically longer than cpu offload only.
This PR using a thread to do the put task, enable submit_put_task return immediately, results in similar vllm serving benchmark duration between cpu+local disk offload and cpu offload only.
Note submit_put_task does not return future, but this is actually not used, thus return None is ok.