feat: provider resource limiter#2117
Merged
Merged
Conversation
- Introduced a new ResourceLimiter to manage concurrent executions based on method type and memory usage. - Added configuration options for heavy and normal request buckets, including max concurrent executions, memory per call, and queue sizes. - Integrated resource limiting into the RPCProviderServer to prevent OOM errors from high-CU requests. - Added Prometheus metrics for monitoring request rejections, queue sizes, and memory usage. - Enhanced setup to allow enabling/disabling the resource limiter via configuration flags.
- Introduced a new test file for ResourceLimiter, covering various scenarios including disabled state, bucket selection based on compute units, concurrency limits, queue timeouts, context cancellation, and memory limits. - Implemented tests for execution errors and metrics tracking to ensure accurate monitoring of request handling. - Added benchmarks for performance evaluation under different load conditions. - Enhanced test coverage for mixed request types and memory reservation release. - Verified error messages for queue full and max concurrent scenarios.
- Added debug logging for queue depth upon request enqueuing and rejection due to full queue. - Improved memory monitoring logs to include current queue depths for heavy and normal buckets. - These enhancements facilitate better tracking and debugging of resource management behavior in the system.
- Modified the call to ServeRPCRequests by adding an additional nil parameter to improve compatibility with recent changes in the RPC provider server implementation. - This adjustment ensures that the function signature aligns with the latest updates, maintaining the integrity of the test suite.
- Updated goroutine closures in the ResourceLimiter tests to use the loop variable directly, ensuring proper indexing for concurrent requests. - Simplified the memory threshold check in the ResourceLimiter implementation for clarity and efficiency.
Test Results3 089 tests +23 3 087 ✅ +22 26m 59s ⏱️ + 3m 21s For more details on these failures, see this check. Results for commit 8e45a1a. ± Comparison against base commit 1bf05bc. ♻️ This comment has been updated with latest results. |
…nto provider_resource_limiter
- Updated goroutine closures in the ResourceLimiter tests to pass the loop index as a parameter, ensuring accurate indexing for concurrent requests. - This change prevents data races and ensures that results are stored correctly in the results slices during concurrent execution.
- Updated the Acquire method in ResourceLimiter to include a nil check for the ResourceLimiter instance, ensuring robustness when the instance is not initialized. - This change prevents potential nil pointer dereference errors and maintains the intended functionality when the resource limiter is disabled.
…ency and queue size - Updated the NewResourceLimiter function to accept additional parameters for heavy and normal bucket configurations, including maximum concurrent requests and queue sizes. - Enhanced the SetupEndpoint method to retrieve these new configuration values from viper, ensuring flexibility in resource management settings. - This change improves the adaptability of the resource limiter to varying load conditions and usage patterns.
…ation - Added a new struct for resource limiter options to encapsulate configuration parameters such as memory threshold, CU threshold, and concurrency settings. - Updated the RPCProvider to utilize the new resource limiter options, enhancing the flexibility of resource management. - Modified the SetupEndpoint method to retrieve resource limiter settings from the new struct, streamlining the initialization process.
- Enhanced the Relay method in RPCProviderServer to ensure proper session cleanup when a request is rejected by the resource limiter. - Added a call to OnSessionFailure to unlock the session, rollback CU deltas, and prevent session leaks. - Improved error logging for cleanup failures to aid in debugging and monitoring.
- Updated the ResourceLimiter to accept an endpoint name during initialization, allowing for differentiated metrics per endpoint. - Modified the NewResourceLimiter and related functions to incorporate the endpoint name, improving metric tracking and clarity. - Adjusted tests to reflect the new parameter, ensuring comprehensive coverage of the updated functionality.
- Added a check in the ResourceLimiter's processQueue method to skip requests that have already been canceled or timed out. - Implemented logging for skipped requests to aid in debugging and monitoring of request handling. - This change improves the robustness of the resource limiter by preventing the execution of invalid requests.
AnnaR-prog
approved these changes
Nov 27, 2025
AnnaR-prog
approved these changes
Nov 27, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Provider Resource Limiter
Summary
Implements a semaphore-based resource limiter for RPC providers to prevent Out-of-Memory (OOM) crashes caused by concurrent high-resource requests. The limiter uses a two-tier bucket system ("heavy" vs "normal") with different concurrency limits and queuing strategies.
Problem
Provider servers experience OOM crashes when handling multiple concurrent resource-intensive requests (e.g.,
debug_traceTransaction, high-CUeth_call).Solution
Method Classification
debug_*andtrace_*methods → "heavy"Resource Limits
Memory Protection
Changes
New Files
protocol/rpcprovider/resource_limiter.go(548 lines) - Core implementationprotocol/rpcprovider/resource_limiter_test.go(731 lines) - 14 tests + 3 benchmarksModified Files
protocol/rpcprovider/rpcprovider.go- Added 6 CLI flags and initializationprotocol/rpcprovider/rpcprovider_server.go- Integrated limiter into relay executionConfiguration
lavap rpcprovider config.yml
--enable-resource-limiter=true
--resource-limiter-memory-gb=8
--resource-limiter-cu-threshold=100
--heavy-max-concurrent=2
--heavy-queue-size=5
--normal-max-concurrent=100
--from mykey
Metrics
lava_provider_resource_limiter_rejections_total{bucket, reason}lava_provider_resource_limiter_queued_total{bucket}lava_provider_resource_limiter_timeouts_total{bucket}lava_provider_resource_limiter_in_flight{bucket}lava_provider_resource_limiter_memory_byteslava_provider_resource_limiter_queue_wait_seconds{bucket}