Skip to content

Flaky unit tests under full-suite CI load: validator DuplicateOutpoint + netsync ChunkFailureCancelsSiblings #1024

@oskarszoon

Description

@oskarszoon

Summary

Two unit tests fail intermittently in the test CI job under full-suite load, but pass deterministically in isolation (including -race -count=20 locally). Both assert on batch/iteration counts that appear sensitive to batcher flush timing, which changed recently in #1017 (per-batcher fixed-cadence flushing / SetTickInterval, commit bbb70638b).

These are flaky, not deterministically broken — a re-run of the same commit passed green.

Failing tests

  1. services/validatorTestValidateTransactionBatch_DuplicateOutpointCreatesConflicting
    Validator_test.go:395: Not equal: expected: 36, actual: 4

  2. services/legacy/netsyncTestSyncManager_createUtxos_ChunkFailureCancelsSiblings
    handle_block_test.go:1384: "4" is not less than or equal to "1"
    "mergeCtx short-circuit should suppress sibling iterations after a chunk fails; observed 4 post-trigger call(s)."

Where observed

CI test job on PR #1023 — run 26839805772 (DONE 10305 tests, 45 skipped, 2 failures). PR #1023 does not touch services/validator or services/legacy/netsync, and a re-run of the identical commit passed — so the failure is not attributable to that PR.

Reproduction attempts (local)

Both pass in isolation, single run and stressed:

go test ./services/validator/ -run '^TestValidateTransactionBatch_DuplicateOutpointCreatesConflicting$' -count=20        # ok
go test ./services/legacy/netsync/ -run '^TestSyncManager_createUtxos_ChunkFailureCancelsSiblings$' -count=20 -race       # ok

The flake only surfaces under the CI runner's concurrent full-suite load, which is consistent with timing/scheduling sensitivity rather than a logic bug.

Suspected cause

Both assertions count emitted/observed items:

  • validator expects 36 conflicting registrations but sees 4 — looks like a batch flushed early (fewer items grouped) so most conflicts weren't observed together.
  • netsync expects ≤1 post-trigger sibling iteration but sees 4 — the short-circuit raced the in-flight batch.

#1017 changed batcher flushing to a fixed cadence (SetTickInterval). A timing-driven flush boundary would plausibly change how many items land per batch under load, perturbing both count assertions. Worth confirming whether these tests pin the batcher tick / use a deterministic flush trigger rather than relying on wall-clock cadence.

Suggested fix direction

Make the two tests deterministic w.r.t. batch flushing — e.g. drive flushes explicitly (size-1 / manual flush / injected clock) instead of depending on the timer cadence, so they don't depend on CI load. Not a release blocker; it's test flakiness.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions