[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching#15515
[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching#15515slin1237 merged 15 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations to the WASM runtime by implementing instance pooling and component caching. These are excellent improvements that will reduce per-request overhead and latency. The code is well-structured and the changes are clearly explained. I have a couple of suggestions to further enhance the implementation: one is a minor optimization to avoid an unnecessary data clone during caching, and the other is a recommendation for a more robust cache eviction strategy to handle high-load scenarios more gracefully. Overall, this is a very valuable contribution.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
slin1237
left a comment
There was a problem hiding this comment.
The Unit tests should be moved to wasm unit test and define a threshold
and clean up all the prints
otherwise, code LGTM
Yea will do it right away |
|
Added a test case in the |
Motivation
I identified a significant per-request overhead in the current WASM middleware implementation within
sgl-model-gateway, which acts as a bottleneck for high-throughput serving.The two primary performance issues addressed in this PR are:
wasmtime::Storeand linear memory (viammap) for every single request.These operations add milliseconds of latency to every request. This PR introduces Instance Pooling to reuse memory slots and LRU Component Caching to skip redundant compilation, ensuring middleware execution remains near-zero cost.
Modifications
I updated
sgl-model-gateway/src/wasm/runtime.rsto implement the following optimizations:Instance Pooling:
wasmtime::PoolingAllocationConfiginto the worker loop.max_memory_size,max_component_instance_size) with the new pooling strategy.Smart Component Caching (LRU):
HashMapstrategy with a Least Recently Used (LRU) Cache (using thelrucrate).Benchmarking and Profiling
I performed a local micro-benchmark simulating 1000 sequential requests to measure the full impact of the Instance Pooling + Caching strategy.
Benchmark Configuration:
Local Results:
The baseline demonstrates the severe cost of re-compiling modules and re-allocating memory on every request (~370µs/op). The optimized pipeline reduces this to a negligible ~6µs/op by leveraging the pre-warmed cache and pre-allocated memory pool.
Checklist