fix(legacy): stop flaky legacy-sync — nil-guard Server.Stop + ephemeral 942 port (#1032)#1034
Conversation
|
🤖 Claude Code Review Status: Complete Current Review: No issues found. This is a solid fix for the intermittent CI failures. Summary: The PR correctly addresses two independent root causes of flaky test failures:
Strengths:
|
ordishs
left a comment
There was a problem hiding this comment.
Approve. Verified the root-cause analysis against the code:
s.serveris assigned only at Server.go:286, so a failedInit(GetOutboundIP or newServer error) leaves it nil.- ServiceManager appends the wrapper before calling
Init(service_manager.go:152 vs 156), soWait()callsStop()on the failed service in reverse order. TheStartgoroutine is scheduled after the Init check, so it never runs —Stopis genuinely the only method invoked on the failed service. - daemon.go:298-302 logs + ForceShutdown but does not return; execution falls through to
sm.Wait()at line 371 — exactly the described crash path. :0is safe: teranodeLegacyListen is only used for ListenAddresses (line 93), never as a dial target; 942 dials out via ConnectPeers=[sv].
Clean, minimal diff with a faithful regression test and an honest test plan (CI box correctly left unchecked). One non-blocking follow-up: the Wait() loop calls Stop() on every failed-Init service, so other services may carry the same latent nil-deref pattern — worth a future sweep. Recommend confirming the legacy-sync CI job is green on the branch before merge.
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-04 12:19 UTC |



Summary
Fixes the intermittent
legacy-syncCI failures onTestLegacyTxBroadcast942_TeranodeRPCToSVMempool. Two independent root causes, two fixes:Nil guard in
legacy.Server.Stop(services/legacy/Server.go). WhennewServerfails (e.g.no valid listen address),Server.Initleaves the inners.servernil. The daemon error path (ServiceManager.Wait→Server.Stop) then derefs the nil*serveratpeer_server.go:2914→ unrecovered SIGSEGV in a daemon goroutine → crashes the whole test binary, reporting(unknown)and discarding all other results in the package. The guard returns nil when there's no inner server, downgrading a process-killing panic to a clean per-test failure. This is the load-bearing fix: any future startup error on this service hit the same crash path.Ephemeral listen port in the 942 test (
test/e2e/daemon/ready/legacy_tx_broadcast_942_test.go). The test pinned the legacy listener to fixed0.0.0.0:48444; under CI load that bind intermittently failed (TIME_WAIT/teardown of the preceding containers), producing the startup error above. Switched to0.0.0.0:0(OS-assigned). Safe because 942 only dials out (ConnectPeers=[sv]) and SV never connects back, so a fixed inbound port is unnecessary — verified the tx-broadcast assertion rides the established outbound connection.Confirmed via
ServiceManagerthatStopis the only method invoked on a service whoseIniterrored, so the guard fully closes the crash path. ProductioninitListenersretry/backoff was deliberately left out (YAGNI — would mask genuinely-occupied ports).Test plan
TestStop_NilInnerServer— fails (recovered nil-deref panic) before the guard, passes after.go test -race ./services/legacy/green.go vet ./services/legacy/ ./test/e2e/daemon/ready/clean.legacy-syncCI job green (the authoritative signal for the e2e port change; Docker-gated locally).Fixes #1032