feat: introducing new WRS algorithm#2224
Conversation
d504dc6 to
7bef701
Compare
45b6f32 to
ada3215
Compare
027eb50 to
c04c913
Compare
b35cc46 to
8d10d7d
Compare
…e normalization, metrics, and test tooling This squashes the full `wrs_algo_feature` history into a single change-set on top of `origin/main`. - Replace legacy tier-based provider selection with a weighted-random selection (WRS) path, including: - stake-based weighting, normalization tweaks, and guardrails against invalid/NaN weights - block-availability / requested-block gating and related tuning - deterministic RNG hooks and stronger concurrency/edge-case test coverage - Add Phase-2 adaptive normalization for latency/sync parameters (P10–P90 + global T-Digest), plus config/tuning changes. - Expand consumer/provider selection observability: - richer selection stats and logging (incl. NaN/Inf sanitization) - missing Prometheus metric registrations + latest-provider-block metric fixes - selection-via-header support and related endpoint/header plumbing - Improve rpcprovider test-mode behavior for deterministic latest_block/head-gap and availability failure simulation; preserve test LatestBlock behavior. - Add/extend scripts and local tooling for WRS analysis and reproducible test scenarios: - WRS analysis helpers + e2e scripts/test data - `wrs_tests/` local framework (stake/latency/availability/sync) and analyzers - Misc hardening/cleanup: - avoid nil ParsedInput panic in parser flow - log timestamp precision tweaks, gofmt, and assorted test/config updates - remove comments - refactor: remove provider optimizer exploration feature - wrs_tests: add local WRS test framework and static stake support - feat(analyze): add log analysis script for provider selection metrics - feat(setup): add init script for static and backup provider configuration - refactor(provideroptimizer): add context parameter to provider selection methods - consumer: add probe-update-weight CLI flag - refactor(weighted_selector): enhance total stake calculation for consistent normalization - Enable nanosecond precision in log timestamps - feat(wrs): Fix Test 3 sync testing with deterministic baseline approach - provideroptimizer: remove block-availability penalty - feat: Add support for provider selection via header in session management - fix: simplify consumer startup message in init script - fix: lava_consumer_latest_provider_block metric showing 0 - fix: Register missing metrics (latestBlockMetric, qosMetric, providerReputationMetric) with Prometheus - feat: Enable QoS optimizer improvements in init script - feat(provider_setup): enhance ETH RPC configuration and validation - refactor(provider_optimizer): update provider selection metrics logging - feat(metrics): improve validation and handling of adaptive bounds - test: Optimize Test 3 startup by using tendermintrpc-only provider config - feat: add Test 3 init script and metrics analysis tool - feat: enhance WRS logging and metrics with parameter contributions - feat(metrics): enhance provider selection logging and validation - feat(metrics): add sanitizeFloat function to handle NaN/Inf values - feat(metrics): enhance provider selection metrics and tracking - chore: upgrade T-Digest library to caio/go-tdigest v5.0.0 - chore: add tdigest dependency for Phase 2 adaptive normalization - feat: enable Phase 2 adaptive normalization (P10-P90 with global T-Digest) - chore: lower availability threshold from 90% to 80% - feat: implement square root scaling for stake normalization - feat: implement Phase 2 P10-P90 adaptive normalization for sync parameter - refactor: improve adaptive normalization - constants, bounds, and logging - feat: implement Phase 2 P10-P90 hybrid adaptive normalization for latency - e2e optimizer: per-provider test inputs + preserve test LatestBlock under DR - feat(metrics): enhance provider selection metrics tracking - rpcprovider test_mode: availability probability with grpc failure - rpcprovider test_mode: deterministic head/gap latest_block + delay controls - chore: gofmt - feat(provideroptimizer): add selection statistics for provider selection - test(provideroptimizer): add weighted selector edge-case + concurrency tests - fixes after rebase - test(provideroptimizer): stabilize availability selection test - test(provideroptimizer): cover NaN/Inf weight configs - fix(provideroptimizer): harden weighted selector weight validation - fix: prevent divide-by-zero panic in WeightedSelector initialization - tune(provideroptimizer): lower default MinSelectionChance - fix(provideroptimizer): align selector normalization with clamp - fix(provideroptimizer): correct requestedBlock Poisson gating - feat: Implement block availability calculation for provider selection - Updated tests - Updated default weights - Removed dead code - Enhance tests with deterministic behavior and error handling improvements - Refactor provider optimizer to support deterministic seed for testing - Enhance WeightedSelector with customizable random number generation - Refactor provider optimizer initialization to remove unused parameter - Refactor provider selection mechanism to remove tier-based logic and enhance weighted selection - test: Enhance provider optimizer tests for statistical validation - test: Refine provider optimizer tests for deterministic behavior - test: Enhance provider optimizer tests for extreme latency scenarios - test: Update provider optimizer tests for weighted selection logic - refactor: Remove legacy tier-based selection and streamline provider optimization - feat: Implement weighted provider selection system - protocol/parser: avoid nil ParsedInput panic in generic parser flow - chore: Update lava_consumer_static_with_backup.yml for provider configuration Co-authored-by: Cursor <cursoragent@cursor.com>
8d10d7d to
d3d9d64
Compare
Review Summary by QodoImplement Weighted Random Selection (WRS) algorithm for provider optimization
WalkthroughsDescription• Replaced tier-based provider selection mechanism with a new Weighted Random Selection (WRS) algorithm for improved QoS-aware provider selection • Implemented WeightedSelector component that calculates composite scores based on availability, latency, sync, and stake metrics with configurable weights • Added AdaptiveMaxCalculator using T-Digest with exponential decay for adaptive P10-P90 normalization of latency and sync metrics • Introduced ProviderStakeCache to maintain provider stake amounts for selection algorithms • Integrated selection statistics tracking throughout the consumer session flow with metrics reporting via SetProviderSelected() and StartSelectionStatsUpdater() • Updated RPC consumer and smart router to use weighted selector configuration with new flags for availability, latency, sync, and stake weights • Added test mode support to RPC provider with artificial delay, availability probability, and latest block configuration • Refactored all provider optimizer tests to validate weighted selection probabilities instead of tier membership • Updated consumer session manager and end-to-end tests to use new method signatures supporting forced provider selection via headers • Removed legacy tier-based selection logic, tier configuration variables, and exploration mechanism • Added comprehensive test coverage for weighted selector, adaptive calculator, and selection statistics Diagramflowchart LR
A["Provider QoS Metrics<br/>availability, latency,<br/>sync, stake"] -->|"normalize with<br/>AdaptiveMaxCalculator"| B["Normalized Scores"]
B -->|"calculate composite<br/>score"| C["WeightedSelector"]
C -->|"weighted random<br/>selection"| D["Selected Provider"]
D -->|"track stats"| E["Selection Statistics"]
E -->|"update metrics"| F["Consumer Metrics"]
G["ProviderStakeCache"] -->|"provide stake<br/>amounts"| C
File Changes1. protocol/provideroptimizer/provider_optimizer_test.go
|
Code Review by Qodo
1. threadSafeRand panics on errors
|
| if n <= 0 { | ||
| panic("invalid argument to Intn") | ||
| } | ||
| maxVal := big.NewInt(int64(n)) | ||
| result, err := cryptorand.Int(cryptorand.Reader, maxVal) | ||
| if err != nil { | ||
| panic("crypto/rand failed: " + err.Error()) | ||
| } |
There was a problem hiding this comment.
1. threadsaferand panics on errors 📘 Rule violation ⛯ Reliability
The new crypto/rand implementation calls panic() on invalid bounds and on crypto/rand failures, which can crash the process instead of handling errors gracefully. The panic message concatenates err.Error(), which may expose internal error details if it reaches user-visible output.
Agent Prompt
## Issue description
`utils/rand/rand.go` uses `panic()` for invalid bounds and `crypto/rand` failures, which can crash the process and may leak internal error details via panic messages.
## Issue Context
Compliance requires robust error handling (no uncontrolled crashes where possible) and secure error handling (don’t expose internal details to user-facing outputs).
## Fix Focus Areas
- utils/rand/rand.go[42-49]
- utils/rand/rand.go[60-62]
- utils/rand/rand.go[72-74]
- utils/rand/rand.go[84-86]
- utils/rand/rand.go[96-98]
- utils/rand/rand.go[106-112]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| // IsStaticProvider returns true when the given provider address belongs to a | ||
| // static provider in the current pairing (including backup providers). | ||
| // | ||
| // This is used by higher-level flows (e.g. WS subscriptions) to decide whether | ||
| // to skip reply signature verification, matching the behavior of regular RPC | ||
| // calls for static providers. | ||
| func (csm *ConsumerSessionManager) IsStaticProvider(providerAddr string) bool { | ||
| if csm == nil || providerAddr == "" { | ||
| return false | ||
| } | ||
|
|
||
| csm.lock.RLock() | ||
| defer csm.lock.RUnlock() | ||
|
|
||
| if cswp, ok := csm.pairing[providerAddr]; ok && cswp != nil { | ||
| cswp.Lock.RLock() | ||
| defer cswp.Lock.RUnlock() | ||
| return cswp.StaticProvider | ||
| } | ||
|
|
||
| if cswp, ok := csm.backupProviders[providerAddr]; ok && cswp != nil { | ||
| cswp.Lock.RLock() | ||
| defer cswp.Lock.RUnlock() | ||
| return cswp.StaticProvider | ||
| } | ||
|
|
||
| return false |
There was a problem hiding this comment.
2. Static purge misclassified 🐞 Bug ✓ Correctness
IsStaticProvider ignores pairingPurge, so a static provider still serving an active WS subscription after epoch handover can be treated as non-static and have reply signature verification enabled, likely terminating the subscription stream.
Agent Prompt
### Issue description
`ConsumerSessionManager.IsStaticProvider` is used by WS subscription verification to decide whether to skip reply signature verification for static providers. During epoch handover, providers can be moved into `pairingPurge` while still actively serving subscriptions, but `IsStaticProvider` does not check `pairingPurge`, causing static providers to be misclassified and signature verification to be incorrectly enforced.
### Issue Context
- Purged providers may remain in use (subscriptions) and are explicitly excluded from immediate connection purging.
- WS subscription message verification uses `IsStaticProvider` to decide whether to call `VerifyRelayReply`.
### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[99-126]
- protocol/lavasession/consumer_session_manager.go[203-214]
- protocol/lavasession/consumer_session_manager.go[324-356]
- protocol/chainlib/consumer_ws_subscription_manager.go[560-581]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| func (csm *ConsumerSessionManager) PeriodicProbeProviders(ctx context.Context, interval time.Duration) { | ||
| ticker := time.NewTicker(interval) | ||
| defer ticker.Stop() // Ensure ticker is stopped to prevent goroutine/channel leak | ||
|
|
||
| for { | ||
| select { |
There was a problem hiding this comment.
3. Ticker never stopped 🐞 Bug ⛯ Reliability
PeriodicProbeProviders creates a time.Ticker but no longer stops it; when ctx is canceled, the function returns without releasing ticker resources.
Agent Prompt
### Issue description
`PeriodicProbeProviders` creates a `time.Ticker` but does not stop it on exit. When `ctx.Done()` triggers, the function returns without calling `ticker.Stop()`, leaking underlying timer resources.
### Issue Context
This function can be started/stopped across component lifecycles and tests; not stopping tickers is a common source of resource leaks.
### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[359-372]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...