mgr/vol: handling the failed non-atomic operation by neesingh-rh · Pull Request #59676 · ceph/ceph

neesingh-rh · 2024-09-09T18:15:37Z

The order of operation done for subvolume removal is reversed. The metadata is updated first followed by a rename operation on the uuid directory. If the metadata update operation fails, then the remove operations is failed thereby, keeping the subvolume metadata consistent with the uuid path.

Fixes: https://tracker.ceph.com/issues/67330
Signed-off-by: Neeraj Pratap Singh neesingh@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

neesingh-rh · 2024-09-12T11:56:38Z

jenkins test api

vshankar

Changes look fine. Possible to add a test real quick @neesingh-rh ?

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py

qa/tasks/cephfs/test_volumes.py

rishabh-d-dave · 2024-10-21T11:43:00Z

jenkins test api

rishabh-d-dave · 2024-10-21T11:43:07Z

jenkins test make check

rishabh-d-dave

Will review it once more in some time.

BTW, have you run the test added on this PR locally?

qa/workunits/fs/full/subvolume_rm_retained_snapshot.sh

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py

qa/suites/fs/full/tasks/mgr-osd-full.yaml

qa/workunits/fs/full/subvol_rm_retained_snap_when_cluster_is_full.sh

kotreshhr · 2024-10-24T18:10:25Z

@neesingh-rh Also please change the commit title of 020e7c6 to something more meaningful - mgr/volumes - Handling the failed non atomic operation doesn't sound complete.
Something like mgr/volumes - Fix subvolume removal in osd full scenario ?

Also in the commit message mention about subvolume removal with --retain-snapshot option. The patch is touching only that case and not the generic subvolume removal when there are no snapshots.

rishabh-d-dave · 2024-10-26T08:40:52Z

Will review it once more in some time.

BTW, have you run the test added on this PR locally?

@neesingh-rh ping

qa/workunits/fs/full/subvol_rm_retained_snap_when_cluster_is_full.sh

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py

neesingh-rh · 2024-11-05T11:45:25Z

Will review it once more in some time.
BTW, have you run the test added on this PR locally?

@neesingh-rh ping

Yes.

neesingh-rh · 2024-11-05T11:45:43Z

@neesingh-rh Also please change the commit title of 020e7c6 to something more meaningful - mgr/volumes - Handling the failed non atomic operation doesn't sound complete. Something like mgr/volumes - Fix subvolume removal in osd full scenario ?

Also in the commit message mention about subvolume removal with --retain-snapshot option. The patch is touching only that case and not the generic subvolume removal when there are no snapshots.

Done, PTAL

rishabh-d-dave

@neesingh-rh I see that PR has gone through some changes since I reviewed it last. Please ping me when changes requested from me have been incorporated and also when PR goes through changes substantial enough.

Current set of changes looks better than last one, I'll do an-depth review shortly.

rishabh-d-dave · 2024-11-11T11:34:44Z

ceph API tests failed for unrelated reasons.

rishabh-d-dave · 2024-11-11T11:34:49Z

jenkins test api

neesingh-rh · 2025-09-02T09:14:19Z

Rebased!

vshankar · 2025-09-02T11:37:30Z

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

kotreshhr · 2025-09-04T12:14:40Z

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

I think this is related to this PR. Since the def remove is changed, it's affecting auto cleanup of subvolumes on failure. @neesingh-rh PTAL.

vshankar · 2025-09-04T12:27:11Z

@neesingh-rh - I think there are failures in fs:volumes which seem to be related to this PR. See: https://pulpito.ceph.com/vshankar-2025-09-01_09:54:24-fs:volumes-wip-vshankar-testing-20250901.061836-debug-testing-default-smithi/

I did check, we are touching the subvolume remove part only in the case when "--retains-snapshots" is true and the test which is failing in the above run is not related to "--retain-snapshots" in any way. I don't think it is related.

OK. I will have a look at the test failures since they are new failures.

I think this is related to this PR. Since the def remove is changed, it's affecting auto cleanup of subvolumes on failure. @neesingh-rh PTAL.

It definitely is -- I reran fs:volumes with and without the change and the issue reproduces.

vshankar

@neesingh-rh - PTAL at the failed test.

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py

The order of operation done for subvolume removal with retain-snapshot option set to true, is reversed. The metadata is updated first followed by a rename operation on the uuid directory. If the metadata update operation fails, then the remove operations is failed thereby, keeping the subvolume metadata consistent with the uuid path. Fixes: https://tracker.ceph.com/issues/67330 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>

… full Fixes: https://tracker.ceph.com/issues/67330 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com> Conflicts: qa/suites/fs/full/tasks/mgr-osd-full.yaml

neesingh-rh · 2025-09-17T15:44:15Z

Here's the run link:
https://pulpito.ceph.com/neesingh-2025-09-17_12:49:10-fs:volumes-wip-neesingh-testing1-250917-distro-default-smithi/
and
https://pulpito.ceph.com/neesingh-2025-09-17_12:09:12-fs-wip-neesingh-testing1-250917-distro-default-smithi/

Failures unrelated.
@vshankar @kotreshhr ready for re-review and suite-run.

kotreshhr

Looks good to me. Hopefully there are no new suprises!

vshankar · 2025-09-22T07:22:17Z

@neesingh-rh - could you give a gist about what the issue was with the last revision that caused the test to fail and what has been done to fix it?

neesingh-rh · 2025-09-22T08:18:49Z

@neesingh-rh - could you give a gist about what the issue was with the last revision that caused the test to fail and what has been done to fix it?

The tests were failing because of saving the self.path value at the wrong place due to which the auto cleanup was getting affected/failing. It should be saved just before the deletion of path from .meta, to have the correct path info.

vshankar · 2025-09-22T09:01:36Z

This PR is under test in https://tracker.ceph.com/issues/73162.

vshankar · 2025-09-26T12:12:48Z

@neesingh-rh - I am seeing a failures in fs:volumes: https://pulpito.ceph.com/vshankar-2025-09-26_04:42:20-fs:volumes-wip-vshankar-testing-20250925.124305-debug-testing-default-smithi/8521305/

In tautology.log, the error happens when removing a subvolume

2025-09-26T05:09:50.758 DEBUG:teuthology.orchestra.run.smithi063:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252
...
...
2025-09-26T05:09:51.552 INFO:teuthology.orchestra.run.smithi063.stderr:Error EAGAIN: subvol_clone77252 clone in-progress -- please cancel the clone and retry
2025-09-26T05:09:51.555 DEBUG:teuthology.orchestra.run:got remote process result: 11

I want to be doubly sure that this is not sue to this change, especially since its the subvolume part that this change modifies and the test failure is too in that code path.

kotreshhr · 2025-09-29T09:06:11Z

@neesingh-rh - I am seeing a failures in fs:volumes: https://pulpito.ceph.com/vshankar-2025-09-26_04:42:20-fs:volumes-wip-vshankar-testing-20250925.124305-debug-testing-default-smithi/8521305/

In tautology.log, the error happens when removing a subvolume
2025-09-26T05:09:50.758 DEBUG:teuthology.orchestra.run.smithi063:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252
...
...
2025-09-26T05:09:51.552 INFO:teuthology.orchestra.run.smithi063.stderr:Error EAGAIN: subvol_clone77252 clone in-progress -- please cancel the clone and retry
2025-09-26T05:09:51.555 DEBUG:teuthology.orchestra.run:got remote process result: 11
I want to be doubly sure that this is not sue to this change, especially since its the subvolume part that this change modifies and the test failure is too in that code path.

@vshankar The first failure is the For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar. which raised the RuntimeError. The custom tearDown function written for the testcase to clean up the cloned subvolumes doesn't consider whether the clone is in progress and attempts to directly delete the clone with force option and hence the error ?

2025-09-26T05:09:51.558 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output ... ERROR
2025-09-26T05:09:51.564 INFO:tasks.cephfs_test_runner:ERROR
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:======================================================================
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:ERROR: test_clones_equal_to_cloner_threads (tasks.cephfs.volumes.test_clone_stats.TestCloneProgressReporter)
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/cephfs/volumes/test_clone_stats.py", line 532, in test_clones_equal_to_cloner_threads
2025-09-26T05:09:51.565 INFO:tasks.cephfs_test_runner:    raise RuntimeError('For 4 clone "ceph status" output has 2 '
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:RuntimeError: For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar.
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:pev -
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:{'mgr-vol-ongoing-clones': {'message': '2 ongoing clones - average progress is 3.946% (1s)\n      [=...........................] (remaining: 34s)', 'progress': 0.03945999965071678, 'add_to_ceph_s': True}, 'mgr-vol-total-clones': {'message': 'Total 3 clones - average progress is 2.63% (0s)\n      [............................] ', 'progress': 0.02630000002682209, 'add_to_ceph_s': True}}
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:======================================================================
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:ERROR: test_clones_equal_to_cloner_threads (tasks.cephfs.volumes.test_clone_stats.TestCloneProgressReporter)
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:Test that one progress bar is printed in output of "ceph status" output
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/cephfs/volumes/test_clone_stats.py", line 49, in tearDown
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:    self.run_ceph_cmd(f'fs subvolume rm --force {v} {sv}')
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/ceph_test_case.py", line 30, in run_ceph_cmd
2025-09-26T05:09:51.566 INFO:tasks.cephfs_test_runner:    return self.mon_manager.run_cluster_cmd(**kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_64801b7661aeb93421857657eb627d609a5602ba/qa/tasks/ceph_manager.py", line 1693, in run_cluster_cmd
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    return self.controller.run(**kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/remote.py", line 575, in run
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 461, in run
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    r.wait()
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 161, in wait
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    self._raise_for_status()
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_423b1b31eec63893e3e862ecded0d45f0f3dcd4d/teuthology/orchestra/run.py", line 181, in _raise_for_status
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:    raise CommandFailedError(
2025-09-26T05:09:51.567 INFO:tasks.cephfs_test_runner:teuthology.exceptions.CommandFailedError: Command failed on smithi063 with status 11: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume rm --force cephfs subvol_clone77252'

vshankar · 2025-09-29T10:03:13Z

@vshankar The first failure is the For 4 clone "ceph status" output has 2 progress bars, it should have only 1 progress bar. which raised the RuntimeError. The custom tearDown function written for the testcase to clean up the cloned subvolumes doesn't consider whether the clone is in progress and attempts to directly delete the clone with force option and hence the error ?

You are right. Thx for checking. I will link up a tracker for the above issue.

github-actions bot added cephfs Ceph File System pybind labels Sep 9, 2024

neesingh-rh force-pushed the wip-67330 branch from b992ec1 to 33f82c3 Compare September 9, 2024 18:16

vshankar requested a review from a team September 12, 2024 11:59

vshankar reviewed Sep 12, 2024

View reviewed changes

github-actions bot added the tests label Sep 23, 2024

kotreshhr reviewed Oct 8, 2024

View reviewed changes

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py Outdated Show resolved Hide resolved

qa/tasks/cephfs/test_volumes.py Outdated Show resolved Hide resolved

neesingh-rh force-pushed the wip-67330 branch from b74ff1e to 5dd8ea5 Compare October 10, 2024 08:05

neesingh-rh requested a review from kotreshhr October 10, 2024 08:06

vshankar assigned rishabh-d-dave and kotreshhr Oct 17, 2024

rishabh-d-dave requested changes Oct 21, 2024

View reviewed changes

neesingh-rh force-pushed the wip-67330 branch 2 times, most recently from 7aec81b to 53fd9dc Compare October 21, 2024 18:13

vshankar requested a review from rishabh-d-dave October 22, 2024 12:52

kotreshhr reviewed Oct 24, 2024

View reviewed changes

qa/workunits/fs/full/subvol_rm_retained_snap_when_cluster_is_full.sh Outdated Show resolved Hide resolved

qa/workunits/fs/full/subvol_rm_retained_snap_when_cluster_is_full.sh Show resolved Hide resolved

rishabh-d-dave reviewed Oct 28, 2024

View reviewed changes

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py Outdated Show resolved Hide resolved

neesingh-rh force-pushed the wip-67330 branch from 53fd9dc to b22ab21 Compare November 5, 2024 11:41

rishabh-d-dave reviewed Nov 11, 2024

View reviewed changes

rishabh-d-dave self-requested a review November 11, 2024 11:34

neesingh-rh force-pushed the wip-67330 branch from 3ed0cc0 to 29f9141 Compare September 2, 2025 09:14

github-actions bot removed the needs-rebase label Sep 2, 2025

vshankar removed the wip-vshankar-testing5 label Sep 4, 2025

vshankar requested changes Sep 4, 2025

View reviewed changes

kotreshhr requested changes Sep 15, 2025

View reviewed changes

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py Show resolved Hide resolved

src/pybind/mgr/volumes/fs/operations/versions/subvolume_v2.py Outdated Show resolved Hide resolved

neeraj pratap singh added 2 commits September 17, 2025 19:36

qa: add test for subvolume rm with retained snapshots when cluster is…

ea624e4

… full Fixes: https://tracker.ceph.com/issues/67330 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com> Conflicts: qa/suites/fs/full/tasks/mgr-osd-full.yaml

neesingh-rh force-pushed the wip-67330 branch from 29f9141 to ea624e4 Compare September 17, 2025 15:41

neesingh-rh requested review from kotreshhr and vshankar September 18, 2025 04:01

kotreshhr approved these changes Sep 18, 2025

View reviewed changes

vshankar approved these changes Sep 22, 2025

View reviewed changes

vshankar added the wip-vshankar-testing5 label Sep 22, 2025

vshankar merged commit 629e419 into ceph:main Sep 29, 2025
13 checks passed

vshankar removed the wip-vshankar-testing5 label Sep 29, 2025

This was referenced Sep 30, 2025

squid: mgr/vol: handling the failed non-atomic operation #65727

Closed

tentacle: mgr/vol: handling the failed non-atomic operation #65728

Merged

joscollin mentioned this pull request Oct 3, 2025

tentacle: mgr/vol: handling the failed non-atomic operation #65769

Closed

14 tasks

joscollin mentioned this pull request Oct 16, 2025

squid: mgr/vol: handling the failed non-atomic operation #65799

Open

14 tasks

Conversation

neesingh-rh commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

neesingh-rh commented Sep 12, 2024

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rishabh-d-dave commented Oct 21, 2024

Uh oh!

rishabh-d-dave commented Oct 21, 2024

Uh oh!

rishabh-d-dave left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kotreshhr commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rishabh-d-dave commented Oct 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neesingh-rh commented Nov 5, 2024

Uh oh!

neesingh-rh commented Nov 5, 2024

Uh oh!

rishabh-d-dave left a comment

Choose a reason for hiding this comment

Uh oh!

rishabh-d-dave commented Nov 11, 2024

Uh oh!

rishabh-d-dave commented Nov 11, 2024

Uh oh!

neesingh-rh commented Sep 2, 2025

Uh oh!

vshankar commented Sep 2, 2025

Uh oh!

kotreshhr commented Sep 4, 2025

Uh oh!

vshankar commented Sep 4, 2025

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

neesingh-rh commented Sep 17, 2025

Uh oh!

kotreshhr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vshankar commented Sep 22, 2025

Uh oh!

neesingh-rh commented Sep 22, 2025

Uh oh!

vshankar commented Sep 22, 2025

Uh oh!

vshankar commented Sep 26, 2025

Uh oh!

kotreshhr commented Sep 29, 2025

Uh oh!

neesingh-rh commented Sep 9, 2024 •

edited

Loading

kotreshhr commented Oct 24, 2024 •

edited

Loading

kotreshhr left a comment •

edited

Loading