Skip to content

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching#15515

Merged
slin1237 merged 15 commits intosgl-project:mainfrom
ppraneth:bug-rou
Dec 20, 2025
Merged

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching#15515
slin1237 merged 15 commits intosgl-project:mainfrom
ppraneth:bug-rou

Conversation

@ppraneth
Copy link
Copy Markdown
Contributor

@ppraneth ppraneth commented Dec 20, 2025

Motivation

I identified a significant per-request overhead in the current WASM middleware implementation within sgl-model-gateway, which acts as a bottleneck for high-throughput serving.

The two primary performance issues addressed in this PR are:

  1. Memory Allocation Overhead: The runtime currently allocates a new wasmtime::Store and linear memory (via mmap) for every single request.
  2. Compilation Overhead: The WASM component is re-compiled from raw bytes (JIT) on every request inside the worker loop.

These operations add milliseconds of latency to every request. This PR introduces Instance Pooling to reuse memory slots and LRU Component Caching to skip redundant compilation, ensuring middleware execution remains near-zero cost.

Modifications

I updated sgl-model-gateway/src/wasm/runtime.rs to implement the following optimizations:

  1. Instance Pooling:

    • Integrated wasmtime::PoolingAllocationConfig into the worker loop.
    • The system now pre-allocates memory slots (configured to 20 per worker thread) to avoid expensive OS memory allocation calls during request processing.
    • Aligned memory limits (max_memory_size, max_component_instance_size) with the new pooling strategy.
  2. Smart Component Caching (LRU):

    • Replaced the naive HashMap strategy with a Least Recently Used (LRU) Cache (using the lru crate).
    • This prevents "cache stampedes" (where clearing a full cache causes a sudden latency spike) by gracefully evicting only the oldest unused modules when the limit is reached.
    • Optimized memory ownership to avoid unnecessary cloning of large WASM binaries during cache insertion.

Benchmarking and Profiling

I performed a local micro-benchmark simulating 1000 sequential requests to measure the full impact of the Instance Pooling + Caching strategy.

Benchmark Configuration:

  • Scenario: Full request pipeline simulation (Compilation check + Instantiation).
  • Module: Simple WASM module requiring 1 Memory Page.
  • Iterations: 1000.

Local Results:

Metric Standard (Baseline) Pooled + Cached (Optimized) Speedup
Total Time 370.95ms 6.27ms 59.17x
Avg Latency 370.95 µs 6.27 µs 98% Reduction

The baseline demonstrates the severe cost of re-compiling modules and re-allocating memory on every request (~370µs/op). The optimized pipeline reduces this to a negligible ~6µs/op by leveraging the pre-warmed cache and pre-allocated memory pool.

Checklist

@ppraneth ppraneth requested a review from slin1237 as a code owner December 20, 2025 06:12
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations to the WASM runtime by implementing instance pooling and component caching. These are excellent improvements that will reduce per-request overhead and latency. The code is well-structured and the changes are clearly explained. I have a couple of suggestions to further enhance the implementation: one is a minor optimization to avoid an unnecessary data clone during caching, and the other is a recommendation for a more robust cache eviction strategy to handle high-load scenarios more gracefully. Overall, this is a very valuable contribution.

Comment thread sgl-model-gateway/src/wasm/runtime.rs Outdated
Comment thread sgl-model-gateway/src/wasm/runtime.rs Outdated
ppraneth and others added 4 commits December 20, 2025 21:05
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unit tests should be moved to wasm unit test and define a threshold
and clean up all the prints
otherwise, code LGTM

@ppraneth
Copy link
Copy Markdown
Contributor Author

The Unit tests should be moved to wasm unit test and define a threshold
and clean up all the prints
otherwise, code LGTM

Yea will do it right away

@ppraneth ppraneth requested a review from slin1237 December 20, 2025 19:41
@ppraneth
Copy link
Copy Markdown
Contributor Author

ppraneth commented Dec 20, 2025

Added a test case in the sgl-model-gateway\src\wasm\runtime.rs
@slin1237 can you check again and tell me if any changes are neede

@slin1237 slin1237 merged commit 537ef18 into sgl-project:main Dec 20, 2025
60 checks passed
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants