feat(rpcsmartrouter): add clock injection debug server for QoS integration tests#2259
Conversation
…to rpcsmartrouter
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 2 files with indirect coverage changes 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Issue: NowFunc is read without synchronization while the HTTP handler writes it from another goroutine. This is a data race.
Suggestion: Use atomic.Value or protect reads/writes with the existing adaptiveLock:
func (po *ProviderOptimizer) now() time.Time {
po.adaptiveLock.RLock()
fn := po.NowFunc
po.adaptiveLock.RUnlock()
if fn != nil {
return fn()
}
return time.Now()
}
There was a problem hiding this comment.
Fixed — replaced float64 currentOffsetSeconds with var currentOffsetNano atomic.Int64, using Swap() in the POST handler and Load() in the GET handler.
There was a problem hiding this comment.
Issue: currentOffsetSeconds is a float64 shared between two HTTP handlers with no synchronization.
Suggestion: Use atomic operations or a mutex. Since it's a float64:
var currentOffset atomic.Int64 // store as nanoseconds or use sync.Mutex
There was a problem hiding this comment.
Commit bd02c40 addressed this comment by introducing atomic.Int64 for currentOffsetNano and using atomic Swap/Load to coordinate the debug handlers, ensuring the offset is synchronized safely instead of sharing a float64 without protection.
There was a problem hiding this comment.
Fixed — replaced http.ListenAndServe with http.Server + graceful shutdown.
Added context.WithCancel at the top of Start() with defer cancel(), so when the function returns on interrupt, ctx.Done() fires and a watcher goroutine calls srv.Shutdown(). http.ErrServerClosed is excluded from error logging since it's expected on clean shutdown.
There was a problem hiding this comment.
Issue: The debug HTTP server goroutine cannot be stopped when the service shuts down.
Suggestion:
srv := &http.Server{Addr: addr, Handler: debugMux}
go func() {
<-ctx.Done()
srv.Shutdown(context.Background())
}()
go func() {
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
utils.LavaFormatError("Debug HTTP server stopped", err)
}
}()
There was a problem hiding this comment.
Commit bd02c40 addressed this comment by creating a cancellable child context for RPCConsumer Start and adding a goroutine that calls srv.Shutdown when that context is done, ensuring the debug HTTP server stops when the service shuts down.
There was a problem hiding this comment.
Same fix applied to rpcsmartrouter.go — added context.WithCancel + defer cancel() at the top of Start() and replaced http.ListenAndServe with http.Server + graceful shutdown, mirroring the fix in rpcconsumer.go.
…artrouter Replace bare http.ListenAndServe goroutine with http.Server + context-aware shutdown. Add context.WithCancel + defer cancel() at the top of Start() so that on os.Interrupt the watcher goroutine calls srv.Shutdown(), draining in-flight requests cleanly. http.ErrServerClosed is excluded from error logging as it is expected on clean shutdown.
Remove redundant \`tc := tc\` copies in table-driven tests to satisfy golangci-lint copyloopvar on Go 1.22+. Updated files: - protocol/rpcconsumer/debug_server_test.go - protocol/rpcsmartrouter/debug_server_test.go - protocol/provideroptimizer/clock_injection_test.go
PR: Clock Injection Debug Server for QoS Integration Tests
What this does
Closes MAG-1545
Adds a debug HTTP server to
rpcsmartrouter(andrpcconsumer) that lets integrationtests shift the internal QoS clock forward without waiting for real time to pass.
Use case: QoS scores in
ProviderOptimizerdecay over time. To test score decayand provider re-selection in a CI environment you would normally have to wait hours.
This change lets a test POST a time offset and the optimizer immediately behaves as if
that much time has elapsed.
Validator / Provider impact
--debug-addressis not set (default).The HTTP server only starts when the flag is explicitly provided.
NowFunc = nilbydefault means
now()is a directtime.Now()call — no overhead.miscellaneous.routers.additionalFlags: ["--debug-address", ":9999"]and enabledevModein
values/core/values.yml(see deploy section).Author Checklist
feat!in the type prefix if API or client breaking change — not applicable, no breaking changesmainbranchprotocol/provideroptimizer/clock_injection_test.go(5 tests:NowFuncDefault,NowFuncOverride,NowFuncOffset,NowFuncReset,ClockInjectionScoreDecay) + live integration test on -smart-routerNowFunc,now(), and debug server blockChanges
protocol/provideroptimizer/provider_optimizer.goNowFunc func() time.Timefield toProviderOptimizerstruct.now()helper that callsNowFuncwhen set, otherwisetime.Now().time.Now()now callpo.now().NowFunc = nil(default) — zero production impact, identical behavior to before.protocol/rpcsmartrouter/rpcsmartrouter.go--debug-addressflag (e.g.--debug-address :9999). Off by default; no-op if not set.http.Server+ watcher goroutine that callssrv.Shutdown()whenctxis cancelled;http.ErrServerClosedexcluded from error logging.context.WithCancel+defer cancel()at the top ofStart()so the debug server is stopped cleanly onos.Interrupt.POST /debug/time-warp— setsNowFuncon every optimizer instance.GET /debug/time— returnsreal_time,effective_time,offset_secondsso callers can verify the clock actually moved.additionalFlags.--skip-policy-verificationand--skip-relay-signingas accepted no-ops(chart 4.0.0 passes these; old binary panicked on unknown flag).
--skip-websocket-verificationand--set-relay-retry-limit— both were already registered further down, causingpanic: flag redefinedon startup.protocol/rpcconsumer/rpcconsumer.go--debug-address,POST /debug/time-warp, andGET /debug/timeadded (mirrors rpcsmartrouter).context.WithCancel+defer cancel()inStart(),http.Serverreplacing barehttp.ListenAndServe.protocol/provideroptimizer/clock_injection_test.go(new)TestProviderOptimizer_NowFuncDefault— nil NowFunc returns realtime.Now()TestProviderOptimizer_NowFuncOverride— set NowFunc returns exactly the override valueTestProviderOptimizer_NowFuncOffset— +1h offset pattern matches expected windowTestProviderOptimizer_NowFuncReset— clearing NowFunc restores real timeTestProviderOptimizer_ClockInjectionScoreDecay— full shift/fail/reset cycle runs without errorAPI
Shift the clock:
Response:
{"offset_seconds":3600,"applied_to_chains":true}Verify the clock moved:
Response:
{"real_time":"2026-04-01T13:46:29Z","effective_time":"2026-04-01T14:46:29Z","offset_seconds":3600}offset_seconds > 0— shift clock forward across all chains.offset_seconds = 0— reset clock back to realtime.Now().GET /debug/time—effective_timeis alwaysreal_time + offset. Use it to confirm the shift was applied.time.Now()at each call — not cumulative.Tested output (live on -smart-router, Apr 1 2026)
How to deploy (integration test environment)
See
DEBUG_SERVER_DEPLOY.mdfor the full step-by-step with troubleshooting.Short version:
GOOS=linux GOARCH=amd64 LAVA_BUILD_OPTIONS="static" LAVA_BINARY=lavap gmake buildscp ./build/lavap <USERNAME>@<server>:/tmp/lavap && echo "DONE"— wait for DONErm -f— plaincpfails with "Text file busy"):lavap versiononly prints semver tag6.1.0— not reliable; usestrings):Production safety
--debug-addressis empty string by default — HTTP server never starts in production.NowFuncisnilby default —now()falls through totime.Now()with zero overhead.How to use in tests
How to enable / disable the flag
Local dev (no Kubernetes) — enable:
Disable: restart without the flag.
Helm / Kubernetes — enable:
Disable: