Skip to content

libnetwork/networkdb: add property-based tests#50345

Merged
robmry merged 3 commits intomoby:masterfrom
corhere:libn/networkdb-property-testing
Jul 16, 2025
Merged

libnetwork/networkdb: add property-based tests#50345
robmry merged 3 commits intomoby:masterfrom
corhere:libn/networkdb-property-testing

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jul 7, 2025

- What I did
I added property-based tests to NetworkDB to find any lingering issues before users encounter them. Testing revealed that the algorithm used to select a random sample of nodes to gossip with does not uphold the expected properties. I replaced the random-sample implementation with one that passes the tests.

- How I did it
I selected pgregory.net/rapid as the property-based testing library to adopt. I wrote a test which asserts that a cluster of NetworkDB instances always converges to the expected state after a series of joins, leaves and table updates. I also wrote a test for the mRandomNodes function to try to narrow down the source of failures with the convergence test. I rewrote its implementation, leveraging the math/rand/v2 package for unbiased uniform random shuffle, permutation and value-in-range algorithms.

As the convergence test is slow, I build-tagged it so it does not run in CI. The mRandomNodes test only takes a couple seconds to complete the default number of iterations so is left untagged.

- How to verify it
go test -tags slowtests

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces property-based testing to the NetworkDB module and replaces the previous random-sample implementation with a uniform approach using math/rand/v2.

  • Vendor updates: add pgregory.net/rapid and github.com/montanaflynn/stats
  • Test enhancements: add networkdb_property_test.go, augment existing helpers, and add TestMRandomNodes
  • Core changes: seed and store a rand.Rand in NetworkDB, remove randomOffset, and rewrite mRandomNodes; add a debugging helper DebugDumpTable

Reviewed Changes

Copilot reviewed 8 out of 66 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vendor.sum Added checksums for new testing and stats libraries
vendor.mod Required rapid and stats for property tests
libnetwork/networkdb/networkdb_test.go Refactored helpers to use TestingT interface and dynamic timeout
libnetwork/networkdb/networkdb_property_test.go New slow-test for convergence using rapid state machine
libnetwork/networkdb/networkdb.go Initialized per-instance RNG and mutex
libnetwork/networkdb/debug.go Added DebugDumpTable for introspection
libnetwork/networkdb/cluster_test.go Added TestMRandomNodes to validate unbiased sampling
libnetwork/networkdb/cluster.go Removed randomOffset, integrated rng for node selection
Comments suppressed due to low confidence (3)

libnetwork/networkdb/networkdb_property_test.go:37

  • Cannot range over integer numNetworks; replace with a traditional loop such as for i := 0; i < numNetworks; i++.
	for i := range numNetworks {

libnetwork/networkdb/cluster_test.go:46

  • min is undefined; define a helper function min(a, b int) int or use an inline comparison to compute the minimum of two ints.
			if len(sample) != min(m, len(nodes)-1) {

libnetwork/networkdb/cluster.go:9

  • The imported alias rnd is unused in this file; remove the import to clean up dependencies.
	rnd "math/rand"

@corhere corhere force-pushed the libn/networkdb-property-testing branch 3 times, most recently from 77f33d3 to e94053c Compare July 7, 2025 21:52
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

google.golang.org/grpc v1.72.2
google.golang.org/protobuf v1.36.6
gotest.tools/v3 v3.5.2
pgregory.net/rapid v1.2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For others; this module has no dependencies, so doesn't impact our dependency-tree, other than what's used here; https://github.com/flyingmutant/rapid/blob/v1.2.0/go.mod

github.com/moby/sys/user v0.4.0
github.com/moby/sys/userns v0.1.0
github.com/moby/term v0.5.2
github.com/montanaflynn/stats v0.7.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one; https://github.com/montanaflynn/stats/blob/v0.7.1/go.mod#L1-L3

A well tested and comprehensive Golang statistics library package with no dependencies.

@thaJeztah
Copy link
Member

We need to look at this failure; wondering if there's a missing check whether the image was built successfully or smthing

=== RUN   TestNoNewPrivileges
=== RUN   TestNoNewPrivileges/CapabilityRequested=true
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
=== RUN   TestNoNewPrivileges/CapabilityRequested=false
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
--- FAIL: TestNoNewPrivileges (1.74s)
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=true (0.00s)
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.00s)
FAIL
coverage: [no statements]

=== Failed
=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=true (0.00s)
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=true (0.00s)

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=false (0.00s)
    capabilities_linux_test.go:81: assertion failed: error is not nil: Error response from daemon: No such image: captest:latest
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.00s)

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges (1.74s)

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

If the test in the second commit fails without the changes in the third commit - they should probably be swapped?

The convergence test was a bit of a slog to get through, while trying to understand the testing package at the same time. Perhaps just me, but a few more comments probably wouldn't go amiss.

@corhere
Copy link
Contributor Author

corhere commented Jul 10, 2025

If the test in the second commit fails without the changes in the third commit - they should probably be swapped?

My rationale for committing the test first was to make it more straightforward to test both the old and new implementations. The second commit justifies the existence for the third commit. It lets the test itself be tested, in the vein of TDD: write the test, verify it fails, fix the implementation, verify the test passes. I think the benefits outweigh the downside of maybe having to skip over the commit when bisecting, particularly since when bisecting you generally will be picking a targeted subset of test cases to run, not entire suites.

@robmry
Copy link
Contributor

robmry commented Jul 10, 2025

My rationale for committing the test first was to make it more straightforward to test both the old and new implementations. The second commit justifies the existence for the third commit. It lets the test itself be tested, in the vein of TDD: write the test, verify it fails, fix the implementation, verify the test passes. I think the benefits outweigh the downside of maybe having to skip over the commit when bisecting, particularly since when bisecting you generally will be picking a targeted subset of test cases to run, not entire suites.

I think the way to deal with that is for the test to "xfail" (maybe skip instead of fail) when it's added. Then the commit that fixes the issue should also change the expected test result.

That deals with the bisecting issue you mention - and makes it very clear what the fix fixes. (I don't know which part of the test fails at the moment.)

@corhere
Copy link
Contributor Author

corhere commented Jul 10, 2025

I think the way to deal with that is for the test to "xfail" (maybe skip instead of fail) when it's added. Then the commit that fixes the issue should also change the expected test result.

I'll see what I can do. Go's testing package does not have any facility to flag a test as expected-to-fail, and even if it did it would not be appropriate. The test will sometimes pass with the bad implementation, because randomness and the implementation is just of low quality, not completely unfit for purpose. I could call Skip() inside the test when encountering a failure condition, but rapid treats it as a signal that the generated test case is invalid. It will continue with another generated test case, and if enough skips happen it will fail the test anyway. I guess I could provide my own implementation of rapid.TB which wires Fail() to Skip()...

(I don't know which part of the test fails at the moment.)

I documented that in the commit message.

libnetwork/networkdb: improve quality of randomness

The property test for the mRandomNodes function revealed that it may not
always pick out a sample of m nodes even when the number of nodes to
pick from (excluding the local node) is >= m. Rewrite it using a random
shuffle or permutation so that it always picks a uniformly-distributed
sample of the requested size whenever the population is large enough.

Signed-off-by: Cory Snider <csnider@mirantis.com>

@robmry
Copy link
Contributor

robmry commented Jul 10, 2025

I documented that in the commit message.

You documented that it can fail, but not where ... I doubt test TestMRandomNodes/EmptySlice fails, and there are about five calls to t.Errorf in the main test. Perhaps only the normal-distribution part fails, or also the all-permutations check, or all of it.

I guess I could provide my own implementation of rapid.TB which wires Fail() to Skip()...

It's probably not worthwhile, that's why I suggested just swapping the commits. Not a blocker I guess, just not really good practice and it had a simple fix.

For my comment about more comments in the convergence test - I was thinking of things like this (just to save a bit of head-scratching for anyone not familiar with the package, which is probably most people for now - and, as convergence is checked after the calls rather than as the "state machine" is updated, it's probably a bit unconventional even for those who are familiar) ...

// Call methods of fsm in a random order.
// Consistency is not checked by these methods, the rest of this function checks convergence afterwards.
t.Repeat(rapid.StateMachineActions(fsm))

(I meant to ask about method networkDBFSM.Check - is it just there as a comment?)

@thompson-shaun thompson-shaun moved this from New to Complete in 🔦 Maintainer spotlight Jul 10, 2025
@corhere corhere force-pushed the libn/networkdb-property-testing branch from e94053c to 256a1f6 Compare July 10, 2025 19:55
@corhere
Copy link
Contributor Author

corhere commented Jul 10, 2025

You documented that it can fail, but not where

I thought it would be self-evident from my description of the failure mode that it's the assertion of the sample size that failed.

if len(sample) != min(m, len(nodes)-1) {
t.Errorf("got sample size %d, want %d", len(sample), min(m, len(nodes)-1))
}

(I meant to ask about method networkDBFSM.Check - is it just there as a comment?)

It's to satisfy the rapid.StateMachine interface. I have added a comment to that effect, as well.

@corhere corhere force-pushed the libn/networkdb-property-testing branch from 256a1f6 to 7c0ff04 Compare July 14, 2025 23:34
Add a property-based test which asserts that a cluster of NetworkDB
nodes always eventually converges to a consistent state. As this test
takes a long time to run it is build-tagged to be excluded from CI.

Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere added 2 commits July 15, 2025 18:07
TestNetworkDBAlwaysConverges will occasionally find a failure where one
entry is missing on one node even after waiting a full five minutes. One
possible explanation is that the selection of nodes to gossip with is
biased in some way. Test that the mRandomNodes function picks a
uniformly distributed sample of node IDs of sufficient length.

The new test reveals that mRandomNodes may sometimes pick out a sample
of fewer than m nodes even when the number of nodes to pick from
(excluding the local node) is >= m. Put the test behind an xfail tag so
it is opt-in to run, without interfering with CI or bisecting.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The property test for the mRandomNodes function revealed that it may
sometimes pick out a sample of fewer than m nodes even when the number
of nodes to pick from (excluding the local node) is >= m. Rewrite it
using a random shuffle or permutation so that it always picks a
uniformly-distributed sample of the requested size whenever the
population is large enough.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/networkdb-property-testing branch from 7c0ff04 to ac5f464 Compare July 15, 2025 22:08
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I really like the property-based testing and it's always nice to have some 👍🏻

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@robmry robmry merged commit 55f0bd8 into moby:master Jul 16, 2025
163 checks passed
@corhere corhere deleted the libn/networkdb-property-testing branch July 16, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants