Skip to content

mgr/vol: handling the failed non-atomic operation#59676

Merged
vshankar merged 2 commits intoceph:mainfrom
neesingh-rh:wip-67330
Sep 29, 2025
Merged

mgr/vol: handling the failed non-atomic operation#59676
vshankar merged 2 commits intoceph:mainfrom
neesingh-rh:wip-67330

Conversation

@neesingh-rh
Copy link
Contributor

@neesingh-rh neesingh-rh commented Sep 9, 2024

The order of operation done for subvolume removal is reversed. The metadata is updated first followed by a rename operation on the uuid directory. If the metadata update operation fails, then the remove operations is failed thereby, keeping the subvolume metadata consistent with the uuid path.

Fixes: https://tracker.ceph.com/issues/67330
Signed-off-by: Neeraj Pratap Singh neesingh@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@neesingh-rh
Copy link
Contributor Author

jenkins test api

@vshankar vshankar requested a review from a team September 12, 2024 11:59
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine. Possible to add a test real quick @neesingh-rh ?

@github-actions github-actions bot added the tests label Sep 23, 2024
@rishabh-d-dave
Copy link
Contributor

jenkins test api

@rishabh-d-dave
Copy link
Contributor

jenkins test make check

Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review it once more in some time.

BTW, have you run the test added on this PR locally?

@neesingh-rh neesingh-rh force-pushed the wip-67330 branch 2 times, most recently from 7aec81b to 53fd9dc Compare October 21, 2024 18:13
@kotreshhr
Copy link
Contributor

kotreshhr commented Oct 24, 2024

@neesingh-rh Also please change the commit title of 020e7c6 to something more meaningful - mgr/volumes - Handling the failed non atomic operation doesn't sound complete.
Something like mgr/volumes - Fix subvolume removal in osd full scenario ?

Also in the commit message mention about subvolume removal with --retain-snapshot option. The patch is touching only that case and not the generic subvolume removal when there are no snapshots.

@rishabh-d-dave
Copy link
Contributor

Will review it once more in some time.

BTW, have you run the test added on this PR locally?

@neesingh-rh ping

@neesingh-rh
Copy link
Contributor Author

Will review it once more in some time.
BTW, have you run the test added on this PR locally?

@neesingh-rh ping

Yes.

@neesingh-rh
Copy link
Contributor Author

@neesingh-rh Also please change the commit title of 020e7c6 to something more meaningful - mgr/volumes - Handling the failed non atomic operation doesn't sound complete. Something like mgr/volumes - Fix subvolume removal in osd full scenario ?

Also in the commit message mention about subvolume removal with --retain-snapshot option. The patch is touching only that case and not the generic subvolume removal when there are no snapshots.

Done, PTAL

Copy link
Contributor

@rishabh-d-dave rishabh-d-dave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neesingh-rh I see that PR has gone through some changes since I reviewed it last. Please ping me when changes requested from me have been incorporated and also when PR goes through changes substantial enough.

Current set of changes looks better than last one, I'll do an-depth review shortly.

@rishabh-d-dave rishabh-d-dave self-requested a review November 11, 2024 11:34
@rishabh-d-dave
Copy link
Contributor

ceph API tests failed for unrelated reasons.

@rishabh-d-dave
Copy link
Contributor

jenkins test api

@neesingh-rh
Copy link
Contributor Author

Rebased!

@vshankar
Copy link
Contributor

vshankar commented Sep 2, 2025

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

@kotreshhr
Copy link
Contributor

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

I think this is related to this PR. Since the def remove is changed, it's affecting auto cleanup of subvolumes on failure. @neesingh-rh PTAL.

@vshankar
Copy link
Contributor

vshankar commented Sep 4, 2025

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

I think this is related to this PR. Since the def remove is changed, it's affecting auto cleanup of subvolumes on failure. @neesingh-rh PTAL.

It definitely is -- I reran fs:volumes with and without the change and the issue reproduces.

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neesingh-rh - PTAL at the failed test.

neeraj pratap singh added 2 commits September 17, 2025 19:36
The order of operation done for subvolume removal with retain-snapshot
option set to true, is reversed. The metadata is updated first
followed by a rename operation on the uuid directory. If the metadata
update operation fails, then the remove operations is failed thereby,
keeping the subvolume metadata consistent with the uuid path.

Fixes: https://tracker.ceph.com/issues/67330
Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>
… full

Fixes: https://tracker.ceph.com/issues/67330
Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>

Conflicts:
	qa/suites/fs/full/tasks/mgr-osd-full.yaml
@neesingh-rh
Copy link
Contributor Author

Copy link
Contributor

@kotreshhr kotreshhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Hopefully there are no new suprises!

@vshankar
Copy link
Contributor

@neesingh-rh - could you give a gist about what the issue was with the last revision that caused the test to fail and what has been done to fix it?

@neesingh-rh
Copy link
Contributor Author

@neesingh-rh - could you give a gist about what the issue was with the last revision that caused the test to fail and what has been done to fix it?

The tests were failing because of saving the self.path value at the wrong place due to which the auto cleanup was getting affected/failing. It should be saved just before the deletion of path from .meta, to have the correct path info.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/73162.

@vshankar
Copy link
Contributor

@neesingh-rh - I am seeing a failures in fs:volumes: https://pulpito.ceph.com/vshankar-2025-09-26_04:42:20-fs:volumes-wip-vshankar-testing-20250925.124305-debug-testing-default-smithi/8521305/

In tautology.log, the error happens when removing a subvolume

2025-09-26T05:09:50.758 DEBUG:teuthology.orchestra.run.smithi063:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252
...
...
2025-09-26T05:09:51.552 INFO:teuthology.orchestra.run.smithi063.stderr:Error EAGAIN: subvol_clone77252 clone in-progress -- please cancel the clone and retry
2025-09-26T05:09:51.555 DEBUG:teuthology.orchestra.run:got remote process result: 11

I want to be doubly sure that this is not sue to this change, especially since its the subvolume part that this change modifies and the test failure is too in that code path.

@kotreshhr
Copy link
Contributor

@neesingh-rh - I am seeing a failures in fs:volumes: https://pulpito.ceph.com/vshankar-2025-09-26_04:42:20-fs:volumes-wip-vshankar-testing-20250925.124305-debug-testing-default-smithi/8521305/

In tautology.log, the error happens when removing a subvolume

2025-09-26T05:09:50.758 DEBUG:teuthology.orchestra.run.smithi063:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252
...
...
2025-09-26T05:09:51.552 INFO:teuthology.orchestra.run.smithi063.stderr:Error EAGAIN: subvol_clone77252 clone in-progress -- please cancel the clone and retry
2025-09-26T05:09:51.555 DEBUG:teuthology.orchestra.run:got remote process result: 11

I want to be doubly sure that this is not sue to this change, especially since its the subvolume part that this change modifies and the test failure is too in that code path.

@vshankar The first failure is the For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar. which raised the RuntimeError. The custom tearDown function written for the testcase to clean up the cloned subvolumes doesn't consider whether the clone is in progress and attempts to directly delete the clone with force option and hence the error ?

2025-09-26T05:09:51.558 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output ... ERROR
2025-09-26T05:09:51.564 INFO:tasks.cephfs_test_runner:ERROR
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:======================================================================
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:ERROR: test_clones_equal_to_cloner_threads (tasks.cephfs.volumes.test_clone_stats.TestCloneProgressReporter)
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/cephfs/volumes/test_clone_stats.py", line 532, in test_clones_equal_to_cloner_threads
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:    raise RuntimeError('For 4 clone "ceph status" output has 2 '
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:RuntimeError: For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar.
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:pev -
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:{'mgr-vol-ongoing-clones': {'message': '2 ongoing clones - average progress is 3.946% (1s)\n      [=...........................] (remaining: 34s)', 'progress': 0.03945999965071678, 'add_to_ceph_s': True}, 'mgr-vol-total-clones': {'message': 'Total 3 clones - average progress is 2.63% (0s)\n      [............................] ', 'progress': 0.02630000002682209, 'add_to_ceph_s': True}}
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:======================================================================
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:ERROR: test_clones_equal_to_cloner_threads (tasks.cephfs.volumes.test_clone_stats.TestCloneProgressReporter)
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/cephfs/volumes/test_clone_stats.py", line 49, in tearDown
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:    self.run_ceph_cmd(f'fs subvolume rm --force {v} {sv}')
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/ceph_test_case.py", line 30, in run_ceph_cmd
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:    return self.mon_manager.run_cluster_cmd(**kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/ceph_manager.py", line 1693, in run_cluster_cmd
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    return self.controller.run(**kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/remote.py", line 575, in run
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 461, in run
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    r.wait()
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 161, in wait
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 181, in _raise_for_status
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi063 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252'

@vshankar
Copy link
Contributor

@vshankar The first failure is the For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar. which raised the RuntimeError. The custom tearDown function written for the testcase to clean up the cloned subvolumes doesn't consider whether the clone is in progress and attempts to directly delete the clone with force option and hence the error ?

You are right. Thx for checking. I will link up a tracker for the above issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants