Skip to content

feat(rpcsmartrouter): add clock injection debug server for QoS integration tests#2259

Merged
nimrod-teich merged 15 commits into
mainfrom
feat/clock-injection-debug-server
Apr 9, 2026
Merged

feat(rpcsmartrouter): add clock injection debug server for QoS integration tests#2259
nimrod-teich merged 15 commits into
mainfrom
feat/clock-injection-debug-server

Conversation

@VicSheCodes

@VicSheCodes VicSheCodes commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

PR: Clock Injection Debug Server for QoS Integration Tests

What this does

Closes MAG-1545

Adds a debug HTTP server to rpcsmartrouter (and rpcconsumer) that lets integration
tests shift the internal QoS clock forward without waiting for real time to pass.

Use case: QoS scores in ProviderOptimizer decay over time. To test score decay
and provider re-selection in a CI environment you would normally have to wait hours.
This change lets a test POST a time offset and the optimizer immediately behaves as if
that much time has elapsed.


Validator / Provider impact

  • Validators: no impact — no changes to consensus, state machine, or on-chain logic.
  • Providers: no impact — no changes to provider-side relay handling.
  • Consumers / Smart router: zero impact when --debug-address is not set (default).
    The HTTP server only starts when the flag is explicitly provided. NowFunc = nil by
    default means now() is a direct time.Now() call — no overhead.
  • Config changes: none required for production. Integration test environments must add
    miscellaneous.routers.additionalFlags: ["--debug-address", ":9999"] and enable devMode
    in values/core/values.yml (see deploy section).

Author Checklist

  • read the contribution guide
  • included the correct type prefix in the PR title: feat
  • confirmed ! in the type prefix if API or client breaking change — not applicable, no breaking changes
  • targeted the main branch
  • provided a link to the relevant issue or specification — MAG-1545
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests — protocol/provideroptimizer/clock_injection_test.go (5 tests: NowFuncDefault, NowFuncOverride, NowFuncOffset, NowFuncReset, ClockInjectionScoreDecay) + live integration test on -smart-router
  • updated the relevant documentation or specification, including comments for Go code — doc comments added to NowFunc, now(), and debug server block
  • confirmed all CI checks have passed — pending CI run

Changes

protocol/provideroptimizer/provider_optimizer.go

  • Added NowFunc func() time.Time field to ProviderOptimizer struct.
  • Added private now() helper that calls NowFunc when set, otherwise time.Now().
  • All score update paths that previously called time.Now() now call po.now().
  • NowFunc = nil (default) — zero production impact, identical behavior to before.

protocol/rpcsmartrouter/rpcsmartrouter.go

  • Added --debug-address flag (e.g. --debug-address :9999). Off by default; no-op if not set.
  • When the flag is set, starts a lightweight HTTP server with graceful shutdown: http.Server + watcher goroutine that calls srv.Shutdown() when ctx is cancelled; http.ErrServerClosed excluded from error logging.
  • Added context.WithCancel + defer cancel() at the top of Start() so the debug server is stopped cleanly on os.Interrupt.
  • Exposes POST /debug/time-warp — sets NowFunc on every optimizer instance.
  • Exposes GET /debug/time — returns real_time, effective_time, offset_seconds so callers can verify the clock actually moved.
  • Flag bound to viper so it works when passed via Helm additionalFlags.
  • Added --skip-policy-verification and --skip-relay-signing as accepted no-ops
    (chart 4.0.0 passes these; old binary panicked on unknown flag).
  • Removed duplicate registrations of --skip-websocket-verification and
    --set-relay-retry-limit — both were already registered further down, causing
    panic: flag redefined on startup.

protocol/rpcconsumer/rpcconsumer.go

  • Same --debug-address, POST /debug/time-warp, and GET /debug/time added (mirrors rpcsmartrouter).
  • Same graceful shutdown fix: context.WithCancel + defer cancel() in Start(), http.Server replacing bare http.ListenAndServe.

protocol/provideroptimizer/clock_injection_test.go (new)

  • TestProviderOptimizer_NowFuncDefault — nil NowFunc returns real time.Now()
  • TestProviderOptimizer_NowFuncOverride — set NowFunc returns exactly the override value
  • TestProviderOptimizer_NowFuncOffset — +1h offset pattern matches expected window
  • TestProviderOptimizer_NowFuncReset — clearing NowFunc restores real time
  • TestProviderOptimizer_ClockInjectionScoreDecay — full shift/fail/reset cycle runs without error

API

Shift the clock:

POST /debug/time-warp
Content-Type: application/json

{"offset_seconds": 3600}

Response:

{"offset_seconds":3600,"applied_to_chains":true}

Verify the clock moved:

GET /debug/time

Response:

{"real_time":"2026-04-01T13:46:29Z","effective_time":"2026-04-01T14:46:29Z","offset_seconds":3600}
  • offset_seconds > 0 — shift clock forward across all chains.
  • offset_seconds = 0 — reset clock back to real time.Now().
  • GET /debug/timeeffective_time is always real_time + offset. Use it to confirm the shift was applied.
  • Offset is relative to real time.Now() at each call — not cumulative.

Tested output (live on -smart-router, Apr 1 2026)

# before shift
curl -s http://localhost:9999/debug/time
{"real_time":"2026-04-01T13:46:29Z","effective_time":"2026-04-01T13:46:29Z","offset_seconds":0}

# shift +1h
curl -s -X POST http://localhost:9999/debug/time-warp -H "Content-Type: application/json" -d '{"offset_seconds":3600}'
{"offset_seconds":3600,"applied_to_chains":true}

# verify
curl -s http://localhost:9999/debug/time
{"real_time":"2026-04-01T13:47:19Z","effective_time":"2026-04-01T14:47:19Z","offset_seconds":3600}

# reset
curl -s -X POST http://localhost:9999/debug/time-warp -H "Content-Type: application/json" -d '{"offset_seconds":0}'
{"offset_seconds":0,"applied_to_chains":true}

curl -s http://localhost:9999/debug/time
{"real_time":"2026-04-01T13:48:31Z","effective_time":"2026-04-01T13:48:31Z","offset_seconds":0}

How to deploy (integration test environment)

See DEBUG_SERVER_DEPLOY.md for the full step-by-step with troubleshooting.

Short version:

  1. Build: GOOS=linux GOARCH=amd64 LAVA_BUILD_OPTIONS="static" LAVA_BINARY=lavap gmake build
  2. Copy: scp ./build/lavap <USERNAME>@<server>:/tmp/lavap && echo "DONE" — wait for DONE
  3. Place (must use rm -f — plain cp fails with "Text file busy"):
    rm -f /root/lavap && cp /tmp/lavap /root/lavap && chmod +x /root/lavap
  4. Verify binary (note: lavap version only prints semver tag 6.1.0 — not reliable; use strings):
    strings /root/lavap | grep "19dad9ee"   # must return the commit hash
  5. Set values and deploy:
    miscellaneous:
      devMode:
        enabled: true
      routers:
        additionalFlags:
          - --debug-address
          - :9999
    helm upgrade smart-router ... --values values/core/values.yml --values values/simulator/values_sim.yml
  6. If binary was placed AFTER helm upgrade — delete pods to pick up the new binary:
    kubectl delete pods -n lava-infra <router-pods>
  7. Verify started:
    kubectl logs -n lava-infra <router-pod> | grep "Debug HTTP server started"

Production safety

  • --debug-address is empty string by default — HTTP server never starts in production.
  • NowFunc is nil by default — now() falls through to time.Now() with zero overhead.
  • No existing tests, interfaces, or API surfaces are changed.

How to use in tests

# 1. send relays to build up QoS scores

# 2. shift clock +1h — scores decay instantly as if an hour passed
curl -s -X POST https://debug.<USERNAME>.magmadevs.com/debug/time-warp \
  -H "Content-Type: application/json" \
  -d '{"offset_seconds": 3600}' | python3 -m json.tool

# 3. verify the clock moved
curl -s https://debug.<USERNAME>.magmadevs.com/debug/time | python3 -m json.tool
# effective_time must be 1h ahead of real_time

# 4. assert provider re-selection / score decay behaviour

# 5. reset so the next test starts with a clean clock
curl -s -X POST https://debug.<USERNAME>.magmadevs.com/debug/time-warp \
  -H "Content-Type: application/json" \
  -d '{"offset_seconds": 0}' | python3 -m json.tool

How to enable / disable the flag

Local dev (no Kubernetes) — enable:

lavap rpcsmartrouter config.yml --debug-address :9999

Disable: restart without the flag.


Helm / Kubernetes — enable:

yq eval -i '.miscellaneous.routers.additionalFlags = ["--debug-address", ":9999"]' \
  ~/smart-router-standalone/values/core/values.yml
# then helm upgrade (see deploy section)

Disable:

yq eval -i 'del(.miscellaneous.routers.additionalFlags)' \
  ~/smart-router-standalone/values/core/values.yml
# then helm upgrade — pods will restart without --debug-address

@codecov

codecov Bot commented Apr 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 51.33333% with 73 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcsmartrouter/rpcsmartrouter.go 33.33% 40 Missing and 4 partials ⚠️
protocol/rpcconsumer/rpcconsumer.go 54.68% 27 Missing and 2 partials ⚠️
Flag Coverage Δ
consensus 8.74% <ø> (ø)
protocol 34.19% <51.33%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/common/cobra_common.go 0.00% <ø> (ø)
protocol/provideroptimizer/provider_optimizer.go 55.71% <100.00%> (+2.39%) ⬆️
protocol/rpcconsumer/rpcconsumer.go 6.57% <54.68%> (+6.57%) ⬆️
protocol/rpcsmartrouter/rpcsmartrouter.go 6.82% <33.33%> (+1.79%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Apr 1, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 988df04. ± Comparison against base commit b9c7b53.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: NowFunc is read without synchronization while the HTTP handler writes it from another goroutine. This is a data race.

Suggestion: Use atomic.Value or protect reads/writes with the existing adaptiveLock:
func (po *ProviderOptimizer) now() time.Time {
po.adaptiveLock.RLock()
fn := po.NowFunc
po.adaptiveLock.RUnlock()
if fn != nil {
return fn()
}
return time.Now()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced float64 currentOffsetSeconds with var currentOffsetNano atomic.Int64, using Swap() in the POST handler and Load() in the GET handler.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: currentOffsetSeconds is a float64 shared between two HTTP handlers with no synchronization.
Suggestion: Use atomic operations or a mutex. Since it's a float64:
var currentOffset atomic.Int64 // store as nanoseconds or use sync.Mutex

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit bd02c40 addressed this comment by introducing atomic.Int64 for currentOffsetNano and using atomic Swap/Load to coordinate the debug handlers, ensuring the offset is synchronized safely instead of sharing a float64 without protection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced http.ListenAndServe with http.Server + graceful shutdown.
Added context.WithCancel at the top of Start() with defer cancel(), so when the function returns on interrupt, ctx.Done() fires and a watcher goroutine calls srv.Shutdown(). http.ErrServerClosed is excluded from error logging since it's expected on clean shutdown.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The debug HTTP server goroutine cannot be stopped when the service shuts down.
Suggestion:
srv := &http.Server{Addr: addr, Handler: debugMux}
go func() {
<-ctx.Done()
srv.Shutdown(context.Background())
}()
go func() {
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
utils.LavaFormatError("Debug HTTP server stopped", err)
}
}()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit bd02c40 addressed this comment by creating a cancellable child context for RPCConsumer Start and adding a goroutine that calls srv.Shutdown when that context is done, ensuring the debug HTTP server stops when the service shuts down.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix applied to rpcsmartrouter.go — added context.WithCancel + defer cancel() at the top of Start() and replaced http.ListenAndServe with http.Server + graceful shutdown, mirroring the fix in rpcconsumer.go.

…artrouter

Replace bare http.ListenAndServe goroutine with http.Server + context-aware
shutdown. Add context.WithCancel + defer cancel() at the top of Start() so
that on os.Interrupt the watcher goroutine calls srv.Shutdown(), draining
in-flight requests cleanly. http.ErrServerClosed is excluded from error
logging as it is expected on clean shutdown.
avitenzer
avitenzer previously approved these changes Apr 6, 2026
@pull-request-size pull-request-size Bot added size/XL and removed size/L labels Apr 9, 2026
VicSheCodes and others added 2 commits April 9, 2026 13:42
Remove redundant \`tc := tc\` copies in table-driven tests to satisfy
golangci-lint copyloopvar on Go 1.22+.

Updated files:
- protocol/rpcconsumer/debug_server_test.go
- protocol/rpcsmartrouter/debug_server_test.go
- protocol/provideroptimizer/clock_injection_test.go
@nimrod-teich nimrod-teich merged commit 5c22939 into main Apr 9, 2026
30 checks passed
@nimrod-teich nimrod-teich deleted the feat/clock-injection-debug-server branch April 9, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants