libnetwork/networkdb: add property-based tests by corhere · Pull Request #50345 · moby/moby

corhere · 2025-07-07T17:54:10Z

- What I did
I added property-based tests to NetworkDB to find any lingering issues before users encounter them. Testing revealed that the algorithm used to select a random sample of nodes to gossip with does not uphold the expected properties. I replaced the random-sample implementation with one that passes the tests.

- How I did it
I selected pgregory.net/rapid as the property-based testing library to adopt. I wrote a test which asserts that a cluster of NetworkDB instances always converges to the expected state after a series of joins, leaves and table updates. I also wrote a test for the mRandomNodes function to try to narrow down the source of failures with the convergence test. I rewrote its implementation, leveraging the math/rand/v2 package for unbiased uniform random shuffle, permutation and value-in-range algorithms.

As the convergence test is slow, I build-tagged it so it does not run in CI. The mRandomNodes test only takes a couple seconds to complete the default number of iterations so is left untagged.

- How to verify it
go test -tags slowtests

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

Copilot

Pull Request Overview

This PR introduces property-based testing to the NetworkDB module and replaces the previous random-sample implementation with a uniform approach using math/rand/v2.

Vendor updates: add pgregory.net/rapid and github.com/montanaflynn/stats
Test enhancements: add networkdb_property_test.go, augment existing helpers, and add TestMRandomNodes
Core changes: seed and store a rand.Rand in NetworkDB, remove randomOffset, and rewrite mRandomNodes; add a debugging helper DebugDumpTable

Reviewed Changes

Copilot reviewed 8 out of 66 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
vendor.sum	Added checksums for new testing and stats libraries
vendor.mod	Required `rapid` and `stats` for property tests
libnetwork/networkdb/networkdb_test.go	Refactored helpers to use `TestingT` interface and dynamic timeout
libnetwork/networkdb/networkdb_property_test.go	New slow-test for convergence using `rapid` state machine
libnetwork/networkdb/networkdb.go	Initialized per-instance RNG and mutex
libnetwork/networkdb/debug.go	Added `DebugDumpTable` for introspection
libnetwork/networkdb/cluster_test.go	Added `TestMRandomNodes` to validate unbiased sampling
libnetwork/networkdb/cluster.go	Removed `randomOffset`, integrated `rng` for node selection

Comments suppressed due to low confidence (3)

libnetwork/networkdb/networkdb_property_test.go:37

Cannot range over integer numNetworks; replace with a traditional loop such as for i := 0; i < numNetworks; i++.

	for i := range numNetworks {

libnetwork/networkdb/cluster_test.go:46

min is undefined; define a helper function min(a, b int) int or use an inline comparison to compute the minimum of two ints.

			if len(sample) != min(m, len(nodes)-1) {

libnetwork/networkdb/cluster.go:9

The imported alias rnd is unused in this file; remove the import to clean up dependencies.

	rnd "math/rand"

thaJeztah

.

thaJeztah · 2025-07-08T10:41:52Z

vendor.mod

 	google.golang.org/grpc v1.72.2
 	google.golang.org/protobuf v1.36.6
 	gotest.tools/v3 v3.5.2
+	pgregory.net/rapid v1.2.0


For others; this module has no dependencies, so doesn't impact our dependency-tree, other than what's used here; https://github.com/flyingmutant/rapid/blob/v1.2.0/go.mod

thaJeztah · 2025-07-08T10:42:55Z

vendor.mod

 	github.com/moby/sys/user v0.4.0
 	github.com/moby/sys/userns v0.1.0
 	github.com/moby/term v0.5.2
+	github.com/montanaflynn/stats v0.7.1


Same for this one; https://github.com/montanaflynn/stats/blob/v0.7.1/go.mod#L1-L3

A well tested and comprehensive Golang statistics library package with no dependencies.

thaJeztah · 2025-07-08T11:33:27Z

We need to look at this failure; wondering if there's a missing check whether the image was built successfully or smthing

=== RUN   TestNoNewPrivileges
=== RUN   TestNoNewPrivileges/CapabilityRequested=true
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
=== RUN   TestNoNewPrivileges/CapabilityRequested=false
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
--- FAIL: TestNoNewPrivileges (1.74s)
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=true (0.00s)
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.00s)
FAIL
coverage: [no statements]

=== Failed
=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=true (0.00s)
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=true (0.00s)

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=false (0.00s)
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.00s)

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges (1.74s)

robmry

Nice!

If the test in the second commit fails without the changes in the third commit - they should probably be swapped?

The convergence test was a bit of a slog to get through, while trying to understand the testing package at the same time. Perhaps just me, but a few more comments probably wouldn't go amiss.

corhere · 2025-07-10T17:03:46Z

If the test in the second commit fails without the changes in the third commit - they should probably be swapped?

My rationale for committing the test first was to make it more straightforward to test both the old and new implementations. The second commit justifies the existence for the third commit. It lets the test itself be tested, in the vein of TDD: write the test, verify it fails, fix the implementation, verify the test passes. I think the benefits outweigh the downside of maybe having to skip over the commit when bisecting, particularly since when bisecting you generally will be picking a targeted subset of test cases to run, not entire suites.

robmry · 2025-07-10T17:14:43Z

My rationale for committing the test first was to make it more straightforward to test both the old and new implementations. The second commit justifies the existence for the third commit. It lets the test itself be tested, in the vein of TDD: write the test, verify it fails, fix the implementation, verify the test passes. I think the benefits outweigh the downside of maybe having to skip over the commit when bisecting, particularly since when bisecting you generally will be picking a targeted subset of test cases to run, not entire suites.

I think the way to deal with that is for the test to "xfail" (maybe skip instead of fail) when it's added. Then the commit that fixes the issue should also change the expected test result.

That deals with the bisecting issue you mention - and makes it very clear what the fix fixes. (I don't know which part of the test fails at the moment.)

corhere · 2025-07-10T17:57:54Z

I think the way to deal with that is for the test to "xfail" (maybe skip instead of fail) when it's added. Then the commit that fixes the issue should also change the expected test result.

I'll see what I can do. Go's testing package does not have any facility to flag a test as expected-to-fail, and even if it did it would not be appropriate. The test will sometimes pass with the bad implementation, because randomness and the implementation is just of low quality, not completely unfit for purpose. I could call Skip() inside the test when encountering a failure condition, but rapid treats it as a signal that the generated test case is invalid. It will continue with another generated test case, and if enough skips happen it will fail the test anyway. I guess I could provide my own implementation of rapid.TB which wires Fail() to Skip()...

(I don't know which part of the test fails at the moment.)

I documented that in the commit message.

libnetwork/networkdb: improve quality of randomness

The property test for the mRandomNodes function revealed that it may not
always pick out a sample of m nodes even when the number of nodes to
pick from (excluding the local node) is >= m. Rewrite it using a random
shuffle or permutation so that it always picks a uniformly-distributed
sample of the requested size whenever the population is large enough.

Signed-off-by: Cory Snider <csnider@mirantis.com>

robmry · 2025-07-10T18:30:27Z

I documented that in the commit message.

You documented that it can fail, but not where ... I doubt test TestMRandomNodes/EmptySlice fails, and there are about five calls to t.Errorf in the main test. Perhaps only the normal-distribution part fails, or also the all-permutations check, or all of it.

I guess I could provide my own implementation of rapid.TB which wires Fail() to Skip()...

It's probably not worthwhile, that's why I suggested just swapping the commits. Not a blocker I guess, just not really good practice and it had a simple fix.

For my comment about more comments in the convergence test - I was thinking of things like this (just to save a bit of head-scratching for anyone not familiar with the package, which is probably most people for now - and, as convergence is checked after the calls rather than as the "state machine" is updated, it's probably a bit unconventional even for those who are familiar) ...

// Call methods of fsm in a random order.
// Consistency is not checked by these methods, the rest of this function checks convergence afterwards.
t.Repeat(rapid.StateMachineActions(fsm))

(I meant to ask about method networkDBFSM.Check - is it just there as a comment?)

corhere · 2025-07-10T19:59:16Z

You documented that it can fail, but not where

I thought it would be self-evident from my description of the failure mode that it's the assertion of the sample size that failed.

moby/libnetwork/networkdb/cluster_test.go

Lines 49 to 51 in 256a1f6

    
           if len(sample) != min(m, len(nodes)-1) { 
        
           	t.Errorf("got sample size %d, want %d", len(sample), min(m, len(nodes)-1)) 
        
           }

(I meant to ask about method networkDBFSM.Check - is it just there as a comment?)

It's to satisfy the rapid.StateMachine interface. I have added a comment to that effect, as well.

libnetwork/networkdb/cluster_test.go

Add a property-based test which asserts that a cluster of NetworkDB nodes always eventually converges to a consistent state. As this test takes a long time to run it is build-tagged to be excluded from CI. Signed-off-by: Cory Snider <csnider@mirantis.com>

TestNetworkDBAlwaysConverges will occasionally find a failure where one entry is missing on one node even after waiting a full five minutes. One possible explanation is that the selection of nodes to gossip with is biased in some way. Test that the mRandomNodes function picks a uniformly distributed sample of node IDs of sufficient length. The new test reveals that mRandomNodes may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Put the test behind an xfail tag so it is opt-in to run, without interfering with CI or bisecting. Signed-off-by: Cory Snider <csnider@mirantis.com>

The property test for the mRandomNodes function revealed that it may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Rewrite it using a random shuffle or permutation so that it always picks a uniformly-distributed sample of the requested size whenever the population is large enough. Signed-off-by: Cory Snider <csnider@mirantis.com>

vvoland

LGTM, I really like the property-based testing and it's always nice to have some 👍🏻

robmry

LGTM

corhere added this to the 29.0.0 milestone Jul 7, 2025

corhere requested review from akerouanton, Copilot, robmry, thaJeztah and vvoland July 7, 2025 17:54

corhere added area/networking Networking area/testing area/swarm area/networking/d/overlay Networking labels Jul 7, 2025

Copilot AI reviewed Jul 7, 2025

View reviewed changes

corhere mentioned this pull request Jul 7, 2025

libnetwork/networkdb: fix broadcast queue deadlocks #50344

Merged

corhere force-pushed the libn/networkdb-property-testing branch 3 times, most recently from 77f33d3 to e94053c Compare July 7, 2025 21:52

thaJeztah reviewed Jul 8, 2025

View reviewed changes

corhere added this to 🔦 Maintainer spotlight Jul 8, 2025

github-project-automation bot moved this to New in 🔦 Maintainer spotlight Jul 8, 2025

robmry reviewed Jul 10, 2025

View reviewed changes

thompson-shaun moved this from New to Complete in 🔦 Maintainer spotlight Jul 10, 2025

corhere force-pushed the libn/networkdb-property-testing branch from e94053c to 256a1f6 Compare July 10, 2025 19:55

vvoland reviewed Jul 11, 2025

View reviewed changes

libnetwork/networkdb/cluster_test.go Outdated Show resolved Hide resolved

corhere force-pushed the libn/networkdb-property-testing branch from 256a1f6 to 7c0ff04 Compare July 14, 2025 23:34

corhere added 2 commits July 15, 2025 18:07

corhere force-pushed the libn/networkdb-property-testing branch from 7c0ff04 to ac5f464 Compare July 15, 2025 22:08

vvoland approved these changes Jul 16, 2025

View reviewed changes

robmry approved these changes Jul 16, 2025

View reviewed changes

robmry merged commit 55f0bd8 into moby:master Jul 16, 2025
163 checks passed

corhere deleted the libn/networkdb-property-testing branch July 16, 2025 18:29

corhere added the process/cherry-pick/25.0 label Jul 23, 2025

corhere mentioned this pull request Jul 25, 2025

[25.0] libnetwork/networkdb: backport all the fixes #50511

Merged

Conversation

corhere commented Jul 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Jul 8, 2025

Uh oh!

robmry left a comment

Choose a reason for hiding this comment

Uh oh!

corhere commented Jul 10, 2025

Uh oh!

robmry commented Jul 10, 2025

Uh oh!

corhere commented Jul 10, 2025

Uh oh!

robmry commented Jul 10, 2025

Uh oh!

corhere commented Jul 10, 2025

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

robmry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants