Skip to content

squid: qa/cephfs: backport mgr/vol improvements for smooth backports in future#61416

Draft
rishabh-d-dave wants to merge 21 commits intoceph:squidfrom
rishabh-d-dave:wip-69561-squid-improvements
Draft

squid: qa/cephfs: backport mgr/vol improvements for smooth backports in future#61416
rishabh-d-dave wants to merge 21 commits intoceph:squidfrom
rishabh-d-dave:wip-69561-squid-improvements

Conversation

@rishabh-d-dave
Copy link
Contributor

@rishabh-d-dave rishabh-d-dave commented Jan 16, 2025

Fixes: https://tracker.ceph.com/issues/69560

Depends on #61415

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 8c536f7)
List volumes returns a list of of 1-member dictionaries where each
member is "{'name': <volname>}". This format is useful printing output
of the command "ceph fs volume ls" but for internal use this format is
hinderance since before using it, this list needs to be resolved  from
list of dictionaries to list of strings.

Return a list of strings and move the code for converting it to list of
dictionaries to method "list_fs_volumes()" (which is method run by "ceph
fs volume ls" command).

This triggers change in async_job.py where we need volnames. The code
here converts list of dicts to list of string. This needs to removed.

Second, don't create unnecessary temporary variable "fs_map" in
"list_volumes()" since it is not required. To get fs names we can
directly iterate.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9f355b6)
Output of command "ceph fs clone status" will now show the progress made
by that specific clone. The output will show cloning completed in terms
of percentage and amount of bytes that have been cloned and number of
files that have been cloned.

Fixes: https://tracker.ceph.com/issues/61904
Signed-off-by: Rishabh Dave <ridave@redhat.com>

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit d7bc828)
Print a progress bar for ongoing clone job in output of "ceph status".

When multiple clones are ongoing, show 1 progress bar in output of
"ceph status" shows average of progress made by each clone.

When number of clone job is more than number of clone threads, print 2
progress bars in output of "ceph status"  command; one for ongoing clone
jobs and other for ongoing+pending clone jobs.

Fixes: https://tracker.ceph.com/issues/61904
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 65b789e)
1. Let caller check for multiple states. It might happen that clone
   finishes while it is being cancelled, in such cases user might want
    to check for both.
2. Add a helper method to check if clone is in pending state and add a
   separate method to check if clone is in cancelled state.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 10949bf)
Add a helper method that accepts command arguments (along with rest of
paramters accepted by the method run_shell()) and return the stdout of
the command.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9f60848)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit db0e736)
TestVolumesHelper._do_subvolume_io() is a helper method that allows
users to generate data for testing. mgr/vol code that reports progress
made by clone jobs depends on the value set for xattr rbytes. It takes
a bit of a time for rbytes to be set.

And, therefore, all tests in TestCloneProgressReporter needs to wait for
subvolume's rbytes xattr's value to be set to the actual amount of data
present on the subvolume before proceeding to actually testing.

So that this can be achieved make _do_subvolume_io() return size of the
data it has generated.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 92aecab)
Clone progress is shown to user through "ceph fs clone status" output
and through "ceph status" output. Test both these features.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit e0c85b8)
Update docs and add release notes about the progress report that is
printed in output of "ceph fs clone status" command and progress bars
that is/are printed in output of "ceph status" command.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 645cc6e)
Test name is test_subvolume_snapshot_info_if_clone_pending_for_no_group,
located in class TestSubvolumeSnapshotClones in test_volumes.py

5 seconds can (sometimes) be insufficient as value of the config option
"snapshot_clone_delay" in this. Increase it to avoid unnecessary race
conditions which leads to irrelevant failures.

Following is an example where 5 seconds was insufficient as waiting
period since instead it took 8 seconds -

2024-07-28T18:16:10.088 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph config set mgr mgr/volumes/snapshot_clone_no_wait False
...
2024-07-28T18:16:18.694 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs subvolume snapshot info cephfs subvol79370 subvol_snap40980

This issue was seen during testing of PR to which this commit belongs.

This commit has been separated from the commit that adds tests for clone
progress reporting so that it's easy to document need for this code
patch and also track it.

This commit is not being moved to a different PR and been kept on the
same PR since it can't be reproduced otherwise. This also ensures that
commit is backported to older release along with code that caused this
issue, causing no one to need to find this commit while backporting
effort.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a6b95a5)
snapshot is retained despite of deletion (using --retain-snapshots
option of "subvolume rm" command).

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 9e34499)
Currently timeout is set to 5. But hardcoding this is unnecessary since
the class already defines a attribute for this purpose. Use that
instead.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ccd5878)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 7239588)
Also add log messages for in these helper methods to allow tracking when
and why more threads were spawned.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a982495)
In class RTimer in mgr_util.py, "self.finished.set()" is run even though
the event self.finished was set just now. If it wasn't set, the while
loop the precedes it would've never finished running. Therefore, remove
this redundant line of code.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 21cf769)
When an exception occurs in class RTimer of mgr_util.py, only the
exception message is logged but not only this is insufficient for
debugging but also it is hard to spot in logs. This should not be the
case, especially for an occurring exception. Therefore, add code to log
traceback and exception name as well along with exception's message.

Log entry before this patch -

2024-09-27T00:22:38.656+0530 7f05c7e006c0  0 [volumes ERROR mgr_util] task exception: dummy exception for testing

Log entry with this patch -

2024-09-27T00:40:26.509+0530 7f61d64006c0  0 [volumes ERROR mgr_util] exception encountered in RTimer instance "<RTimer(Thread-7, started daemon 140058183075520)>":
Traceback (most recent call last):
  File "/home/rishabh/repos/ceph/minor3/src/pybind/mgr/mgr_util.py", line 91, in run
    self.function(*self.args, **self.kwargs)
  File "/home/rishabh/repos/ceph/minor3/src/pybind/mgr/volumes/fs/stats_util.py", line 232, in _update_progress_bars
    raise RuntimeError('dummy exception for testing')
RuntimeError: dummy exception for testing

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b96d714)
Orignally, when the feature was in development, IDs for clone progress
bars were set to randomly generated UUID strings. But, eventually, it
was decided to assign fixed strings to them. Unlike UUIDs, these strings
stay the same even when progress bars are destroyed and re-created.
Therefore, instead of re-assigning the same strings every time
initiate_reporting() is called, move them to __init__() so that both the
IDs are defined only once.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 3482ebc)
It is a good practice and a common convention to call base class's
__init__() method at the beginning of derived class's __init__(). It
ensures proper initialization of base class's attributes and methods.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ad9ff38)
Add some comment to explains the purpose and details of data structures
that are used for queuing of async jobs since right now it's not obvious
from initialization.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 8f7c5f0)
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ce5d62d)
@rishabh-d-dave
Copy link
Contributor Author

jenkins test pull_request_target

@kotreshhr
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/71055.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@joscollin
Copy link
Member

@rishabh-d-dave Ping for rebase.

@rishabh-d-dave
Copy link
Contributor Author

@rishabh-d-dave Ping for rebase.

This backport depends on #61415, #61415 needs to merge first. I've rebased that PR.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rebase

@batrick batrick marked this pull request as draft March 18, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants