Skip to content

test/librbd: fix race condition in ConcurrentOperations test#66729

Open
tchaikov wants to merge 1 commit intoceph:mainfrom
tchaikov:wip-rbd-test-racing
Open

test/librbd: fix race condition in ConcurrentOperations test#66729
tchaikov wants to merge 1 commit intoceph:mainfrom
tchaikov:wip-rbd-test-racing

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Dec 24, 2025

The ConcurrentOperations test had a race condition where threads create_snap2 and create_snap3 were started before image1 finished its snap_create and aio_close operations.

Since image1 holds the exclusive lock, when create_snap2 and create_snap3 try to create snapshots, they must either:

  1. Send remote requests to image1 (the lock owner), or
  2. Wait to acquire the lock after image1 releases it

However, image1 is busy completing its own snap_create and then executing aio_close, so it cannot process remote requests properly. This causes the remote requests to timeout or fail, resulting in snap_create returning non-zero error codes and triggering the ceph_assert(r == 0) failures.

The fix ensures image1 fully completes (including aio_close and lock release) before starting create_snap2 and create_snap3 threads. This allows image2 or image3 to acquire the lock cleanly instead of trying to coordinate with a closing image.

Fixes: https://tracker.ceph.com/issues/70691

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

The ConcurrentOperations test had a race condition where threads
create_snap2 and create_snap3 were started before image1 finished
its snap_create and aio_close operations.

Since image1 holds the exclusive lock, when create_snap2 and
create_snap3 try to create snapshots, they must either:
1. Send remote requests to image1 (the lock owner), or
2. Wait to acquire the lock after image1 releases it

However, image1 is busy completing its own snap_create and then
executing aio_close, so it cannot process remote requests properly.
This causes the remote requests to timeout or fail, resulting in
snap_create returning non-zero error codes and triggering the
ceph_assert(r == 0) failures.

The fix ensures image1 fully completes (including aio_close and lock
release) before starting create_snap2 and create_snap3 threads. This
allows image2 or image3 to acquire the lock cleanly instead of trying
to coordinate with a closing image.

Fixes: https://tracker.ceph.com/issues/70691
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
@tchaikov tchaikov requested a review from a team as a code owner December 24, 2025 03:10
@tchaikov tchaikov requested review from Copilot and removed request for a team December 24, 2025 03:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in the ConcurrentOperations test where snapshot creation threads were started prematurely, causing test failures.

Key Changes:

  • Reordered synchronization to ensure image1 fully completes (including async close and lock release) before starting create_snap2 and create_snap3 threads
  • Moved the quiesce_complete and create_snap1.join() calls earlier in the test flow
  • Moved the close completion checks for image1 to complete before spawning new snapshot creation threads

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tchaikov tchaikov requested a review from idryomov December 24, 2025 04:00
@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 24, 2025

"make check (arm64)" failed:

Collecting cram@ git+https://github.com/ceph/cram.git@0.7-error-dir
  Cloning https://github.com/ceph/cram.git (to revision 0.7-error-dir) to /tmp/pip-install-6ahrv7ar/cram_362e6273bb734d1bbe9992da167a4124
  Running command git clone --filter=blob:none --quiet https://github.com/ceph/cram.git /tmp/pip-install-6ahrv7ar/cram_362e6273bb734d1bbe9992da167a4124
  Running command git checkout -b 0.7-error-dir --track origin/0.7-error-dir
  Switched to a new branch '0.7-error-dir'
  Branch '0.7-error-dir' set up to track remote branch '0.7-error-dir' from 'origin'.
  Resolved https://github.com/ceph/cram.git to commit aca162f6aeb40d0790998a3e59f13b9301f0f673
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Using legacy 'setup.py install' for cram, since package 'wheel' is not installed.
Installing collected packages: cram
  Running setup.py install for cram: started
  error: subprocess-exited-with-error
  
  × Running setup.py install for cram did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Running setup.py install for cram: finished with status 'error'
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> cram

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

see https://jenkins.ceph.com/job/ceph-pull-requests-arm64/84863/

should be fixed by #66733

@tchaikov
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@tchaikov
Copy link
Contributor Author

jenkins test make check arm64

@cbodley
Copy link
Contributor

cbodley commented Feb 17, 2026

quoting a discussion from slack:

Casey Bodley: Kefu raised a fix for ConcurrentOperations in #66729
Ilya Dryomov: I saw that PR, but it's more of a workaround than the fix -- it limits concurrency in a test that is named ConcurrentOperations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants