[RFC] Parallel get to improve TTFT a bit#863
Conversation
|
👍 I/O concurrency is extremely valuable to maximize the throughput of any storage because it hides roundtrip delays, and I agree this is notably missing from LMCache at the moment. For filesystem I/O - aiofiles already uses a threadpool inside (used by LocalDiskBackend and FSConnector), so if we could make sure that the engine (both layerwise and blockwise) will use async this will be great. For network I/O - it is typically considered most efficient to use asyncio - however this has a hidden throughput limit because a single thread running an async event loop will eventually be limited by ~3GB/s speed due to the overhead of actual data copying to/from the socket. In some cases, this can be sufficient, because there are multiple separate LMCache instances and each one will add another ~3GB to the total throughput, but when each GPU requires more KV throughput, a single thread will limit the ability to saturate the GPU. For which backend did you see TTFT improvement? |
|
@guymguym I tested |
|
@da-x Could you fix the code quality check first? |
|
For remote connectors, we’d love to have something like |
9667428 to
704e558
Compare
|
@guymguym @xiaguan |
| get_thread_pool( | ||
| self.config.parallel_blocking_get_thread_count | ||
| ).submit(self.storage_manager.get, key) |
There was a problem hiding this comment.
@da-x I wonder if it's needed to separate this threadpool from the default executor that asyncio/aiofiles already initialized lazy? see these docs -
awaitable loop.run_in_executor(executor, func, *args)¶
The executor argument should be an concurrent.futures.Executor instance. The default executor is used if executor is None. The default executor can be set by loop.set_default_executor(), otherwise, a concurrent.futures.ThreadPoolExecutor will be lazy-initialized and used by run_in_executor() if needed.
class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())¶
Changed in version 3.13: Default value of max_workers is changed to min(32, (os.process_cpu_count() or 1) + 4).
There was a problem hiding this comment.
we might still want to add a general config option to override the default executor number of workers...
There was a problem hiding this comment.
@da-x I wonder if it's needed to separate this threadpool from the default executor that asyncio/aiofiles already initialized lazy? see these docs -
- The threads created for the threadpool are not connected to the current default loop. They have no loop, unless they decide to create it or bind it specifically. See example program.
- If 'blocking get' would use an existing loop to which we have an handle, say,
self.loop, then we would have an issue here, but I thinkself.loopis only being used for put and prefetch. You can check this too. - To avoid adding another config option, maybe we can allow
"default"value for using the Python default instead of passing a number, in such that the type isstring | number | None?
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def threaded_f():
loop = asyncio.get_running_loop()
# This is a different loop than in `main`.
print("loop in threaded_f", id(loop))
def has_no_async_loop():
# NOTE:
# asyncio.get_running_loop()
# Throws: Throws RuntimeError: no running event loop
asyncio.run(threaded_f())
async def main():
loop = asyncio.get_running_loop()
print("loop in main", id(loop))
futures = []
tp = ThreadPoolExecutor(max_workers=1)
for x in range(1):
futures.append(tp.submit(has_no_async_loop))
_results = [f.result() for f in futures]
if __name__ == '__main__':
asyncio.run(main())|
@da-x Could you resolve the conflict a bit? |
|
hi, @da-x, I test this PR with |
|
There is a new @chunxiaozheng what backend were you using? |
c2a841e to
8b6b0f8
Compare
We saw a 10% improvement in TTFT by allowing `blocking_get` to be served from threadpool. This adds the ability to enable this threadpool and to specify the size of the threadpool in threads. Signed-off-by: Dan Aloni <dan@kernelim.com>
8b6b0f8 to
a70cc96
Compare
|
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days. |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it! |
We saw a 10% improvement in TTFT by allowing
blocking_getto be served from threadpool.This adds the ability to enable this threadpool and to specify the size of the threadpool in the amount of threads.
I'm not expecting this to be merged as-is... I am putting this here more to open a discussion.
Some points:
LayerwiseLMCacheEnginewhich is using the async APIs for fetching KV cache, so adding a thread pool may seem somewhat redundant. However after some experimentation I see it is either not fully stable or does not improve latency over originalLMCacheEngine. Even afterLayerwiseLMCacheEnginewould work well, I think this change still has value, because it would allow to do a more robust comparison of the performance difference of the the two engines.parallel_blocking_get_thread_count. Please suggest.