test/librbd: fix race condition in ConcurrentOperations test#66729
Open
test/librbd: fix race condition in ConcurrentOperations test#66729
Conversation
The ConcurrentOperations test had a race condition where threads create_snap2 and create_snap3 were started before image1 finished its snap_create and aio_close operations. Since image1 holds the exclusive lock, when create_snap2 and create_snap3 try to create snapshots, they must either: 1. Send remote requests to image1 (the lock owner), or 2. Wait to acquire the lock after image1 releases it However, image1 is busy completing its own snap_create and then executing aio_close, so it cannot process remote requests properly. This causes the remote requests to timeout or fail, resulting in snap_create returning non-zero error codes and triggering the ceph_assert(r == 0) failures. The fix ensures image1 fully completes (including aio_close and lock release) before starting create_snap2 and create_snap3 threads. This allows image2 or image3 to acquire the lock cleanly instead of trying to coordinate with a closing image. Fixes: https://tracker.ceph.com/issues/70691 Signed-off-by: Kefu Chai <k.chai@proxmox.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in the ConcurrentOperations test where snapshot creation threads were started prematurely, causing test failures.
Key Changes:
- Reordered synchronization to ensure image1 fully completes (including async close and lock release) before starting create_snap2 and create_snap3 threads
- Moved the
quiesce_completeandcreate_snap1.join()calls earlier in the test flow - Moved the close completion checks for image1 to complete before spawning new snapshot creation threads
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
"make check (arm64)" failed: see https://jenkins.ceph.com/job/ceph-pull-requests-arm64/84863/ should be fixed by #66733 |
Contributor
Author
|
jenkins test make check arm64 |
1 similar comment
Contributor
Author
|
jenkins test make check arm64 |
14 tasks
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The ConcurrentOperations test had a race condition where threads create_snap2 and create_snap3 were started before image1 finished its snap_create and aio_close operations.
Since image1 holds the exclusive lock, when create_snap2 and create_snap3 try to create snapshots, they must either:
However, image1 is busy completing its own snap_create and then executing aio_close, so it cannot process remote requests properly. This causes the remote requests to timeout or fail, resulting in snap_create returning non-zero error codes and triggering the ceph_assert(r == 0) failures.
The fix ensures image1 fully completes (including aio_close and lock release) before starting create_snap2 and create_snap3 threads. This allows image2 or image3 to acquire the lock cleanly instead of trying to coordinate with a closing image.
Fixes: https://tracker.ceph.com/issues/70691
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.