Skip to content

Conversation

@fjahr
Copy link
Contributor

@fjahr fjahr commented May 12, 2022

Fixes #25031

The feature_index_prune.py test seems to be racy because connections are reestablished after restarts and the blocks are synced via the sync_blocks function. The sync_blocks function has a sanity check at the beginning to check that all nodes in the set have at least one established connection and that is not always the case.

As a solution nodes are not connected via the -connect parameter on start but instead via the connect_nodes helper.

Copy link
Contributor

@theStack theStack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

Is there an explanation why this race is only (or at least, much more likely, it seems) happening on Windows? If someone knows how to reproduce the fixed failure behaviour on a Linux/BSD system, would be nice to know.

@fjahr
Copy link
Contributor Author

fjahr commented May 12, 2022

Concept ACK

Is there an explanation why this race is only (or at least, much more likely, it seems) happening on Windows? If someone knows how to reproduce the fixed failure behaviour on a Linux/BSD system, would be nice to know.

Agree, it would be nice to be sure :) My assumption is that the Windows build is just slower than everything else and that's why we have only seen it there so far. Looking at https://cirrus-ci.com/github/bitcoin/bitcoin it always seems the be the job that takes the longest time.

@DrahtBot DrahtBot added the Tests label May 13, 2022
Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should addconnection be less racy than connect?

Why not use connect_nodes which has the wait_until built into? #25031 (comment)

@maflcko
Copy link
Member

maflcko commented May 13, 2022

It should be possible to reproduce by adding a sleep in the -connect thread?

@fjahr
Copy link
Contributor Author

fjahr commented May 14, 2022

I was able to reproduce the issue as well with this, thanks @MarcoFalke !

why should addconnection be less racy than connect?

From my understanding of the code addconnection only returns after the connection has been established (or failed). It doesn't seem to use threading and since it's only available for testing I thought something like this should be it's primary use-case. I confirmed it works with your hint mentioned above.

Why not use connect_nodes which has the wait_until built into? #25031 (comment)

Right, I can replace addconnection here with connect_nodes, I didn't remember connect_nodes has the wait built into it. It looks a little less efficient (calling more RPCs under the hood) but saves us another line for the ip+port in the test, so I am indifferent between the two. I just didn't like the one-liner in that link because it's harder to read and would require a comment to explain that this is a clunky way to protect against a potential race condition.

@fjahr fjahr force-pushed the 202205-index-prune-fix branch from f42e25b to cfb807c Compare May 14, 2022 15:30
@fjahr fjahr changed the title test: Use addconnection for robustness in index prune test test: Fix race condition in index prune test May 14, 2022
Nodes are restarted and reconnected as part of the test. Afterwards
`sync_blocks` is called immediately on the nodes. `sync_blocks`
first checks that all the included nodes have at least one
connection. Since adding a connection is usually happening in a
thread, sometimes nodes could run into this check before the
connection was fully established so that it would fail the entire
test.

This fix uses the `connect_nodes` helper to make the connection the
nodes. `connect_nodes` has a wait for the connection built into it.
@fjahr fjahr force-pushed the 202205-index-prune-fix branch from cfb807c to 4faa550 Compare May 14, 2022 15:33
@fjahr
Copy link
Contributor Author

fjahr commented May 14, 2022

Updated to use connect_nodes helper instead of addconnection RPC.

@maflcko maflcko merged commit b74a6dd into bitcoin:master May 15, 2022
@maflcko
Copy link
Member

maflcko commented May 15, 2022

If addconnection is somehow better than the stuff in connect_nodes, it could make sense to use addconnection in the connect_nodes impl?

@fjahr
Copy link
Contributor Author

fjahr commented May 15, 2022

If addconnection is somehow better than the stuff in connect_nodes, it could make sense to use addconnection in the connect_nodes impl?

Yeah, I was thinking about that as well. I will give it a try and benchmark.

sidhujag pushed a commit to syscoin/syscoin that referenced this pull request May 28, 2022
4faa550 test: Fix race condition in index pruning test (Fabian Jahr)

Pull request description:

  Fixes bitcoin#25031

  The `feature_index_prune.py` test seems to be racy because connections are reestablished after restarts and the blocks are synced via the `sync_blocks` function. The `sync_blocks` function has a sanity check at the beginning to check that all nodes in the set have at least one established connection and that is not always the case.

  As a solution nodes are not connected via the `-connect` parameter on start but instead via the `connect_nodes` helper.

Top commit has no ACKs.

Tree-SHA512: f88377715f455f1620725fe8ebd6b486fa0209660b193bf68d1ce1452e2086ac5d169d8ca4c2b61443566232e96fb9c6386ee482bc546cce38078d72e7c3c29f
@bitcoin bitcoin locked and limited conversation to collaborators May 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent win64 CI failure in feature_index_prune.py

4 participants