mds/QuiesceDb: Manager, Agent, a-sock i-face, mgr/volumes CLI by leonid-s-usov · Pull Request #54485 · ceph/ceph

leonid-s-usov · 2023-11-13T18:58:20Z

Quiesce DB is one of the components of the "Consistent Snapshots" epic.
The solution is overviewed in a slide deck available for viewing to @redhat users.

Related redmine tickets:

Feature #63665: mds: QuiesceDb to manage subvolume quiesce state
Feature #63666: mds: QuiesceAgent to execute quiesce operations on an MDS rank
Feature #63668: pybind/mgr/volumes: add quiesce protocol API
Tasks #63707: mds: AdminSocket command to control the QuiesceDbManager

This PR focuses on the replicated quiesce database maintained by the MDS rank cluster. One of the major goals was to design the component in a way that can be easily tested outside of the MDS infrastructure, which is why the communication layer has been abstracted out by introducing just two communication callbacks that will need to be implemented by the infrastructure.

Most of the component code is delivered in a single coherent commit, along with the unit tests. Other commits will be dedicated to integration with the MDS infrastructure and other changes that can't be attributed to the core quiesce db code or its tests.

The quiesce db component is composed of the following major parts/actors:

QuiesceDbManager is the main actor, implementing both the leader and the replica roles. Normally, there will be an instance of the manager per MDS rank, although, given the decoupling of the infrastructure and the manager, one can run any number of instances on a single node, which is how tests are working.
The manager interfaces to the infrastructure via two main APIs with the infrastructure that provides communication and cluster configuration (actor 2) and the quiesce db client that is responsible for the quiescing of the roots (actor 3)
- QuiesceClusterMembership is how the manager is configured to be part of a (virtual) cluster. This structure will deliver information about other peers, and the leader, and provide two communication APIs: send_listing_to for db replication from the leader to replicas and send_ack for reporting quiesce success from the agents.
- The client interface consists of a QuisceMap notify callback and a dedicated manager method to submit asynchronous acks following the agent (rank) quiesce progress.
Last but not the least, is the manager's user-facing API. The quiesce db user API implements the CLI behavior described in the slide deck mentioned above. The full scope of capabilities is encapsulated in a single QuiesceDbRequest structure. This should help with the implementation of other components that will have to propagate the functionality to the administrator user of the volumes plugin.

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

leonid-s-usov · 2023-11-14T09:44:04Z

jenkins test make check

leonid-s-usov · 2023-11-14T11:02:59Z

jenkins test make check

batrick

I have more code review to do but this is a nice checkpoint. I really appreciate your efforts to make this unit testable. Well done!

src/common/subsys.h

src/common/Cond.h

src/mds/MDSRank.cc

qa/tasks/cephfs/mount.py

src/test/mds/TestQuiesceDb.cc

src/mds/QuiesceDb.h

src/mds/QuiesceDbManager.h

* create an instance of the QuiesceDbManager in the rank * update membership with a new mdsmap * add an admin socket command for sending requests to the manager Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

Fixes: https://tracker.ceph.com/issues/63708 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

With these dedicated structs we can fully defer to QuiesceDbEncoding when encoding/decoding quiesce db messages Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

leonid-s-usov · 2024-03-04T13:41:12Z

@vshankar here's the latest re-run of the fs suite on this PR post the latest changes to address the review comments. That included utilizing the Monitor map for calculating and broadcasting the common cluster state.

https://pulpito.ceph.com/leonidus-2024-03-04_07:05:03-fs-wip-lusov-qdb-distro-default-smithi/

batrick

Looks good to qa/merge.

batrick · 2024-03-04T15:34:46Z

src/mds/FSMap.cc

+    auto const& leader = fs.mds_map.get_quiesce_db_cluster_leader();
+    auto const& members = fs.mds_map.get_quiesce_db_cluster_members();
+    ceph_assert(leader == MDS_GID_NONE || members.contains(leader));
+    ceph_assert(std::ranges::all_of(members, [&infos = fs.mds_map.mds_info](auto m){return infos.contains(m);}));


This last assert may backfire on you. If you ever change it so members may exist outside the MDSMap, then any monitor that decodes (update_from_paxos) the change (without being upgraded) will fail this sanity check.

However, I think the likelihood of that is low and we have in the past indicated that a monitor upgrade should disable the fsmap sanity checks for this type of problem (which we don't like doing).

Well, I'll make sure to only update the logic once all monitors are on the new version.

leonid-s-usov · 2024-03-04T16:02:21Z

Thanks Patrick! 🎉
@vshankar now we just need a green light from the fs suite. Please let me know if you find any red flags in that run above.

I was investigating this Resize test failure, and I couldn't figure out the reason for one of the MDSes to start lagging. It looks like the mdslock held locked, but I doubt it could be the result of the quiesce db, as it's empty in all these tests.

it's in the ceph-mds.b.log:

2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.b Updating MDS map to version 294 from mon.2
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b my gid is 7109
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b map says I am mds.2.289 state up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b msgr says I am [v2:172.21.15.177:6832/2547507192,v1:172.21.15.177:6834/2547507192]
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b handle_mds_map: handling map as rank 2
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map I am now mds.2.289
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map state change up:active --> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.beacon.b set_want_state: up:active -> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  2 mds.2.289 Stopping...
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.2.cache shutdown_start
....
2024-03-04T07:42:47.800+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-04T07:42:47.800+0000 7f5886e00640  0 mds.beacon.b Skipping beacon heartbeat to monitors (last acked 4.00088s ago); MDS internal heartbeat is not healthy!
2024-03-04T07:42:47.800+0000 7f5886e00640 20 mds.beacon.b sender thread waiting interval 0.5s
2024-03-04T07:42:48.301+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

leonid-s-usov · 2024-03-04T16:02:33Z

jenkins test make check

vshankar · 2024-03-05T13:17:23Z

Thanks Patrick! 🎉 @vshankar now we just need a green light from the fs suite. Please let me know if you find any red flags in that run above.

I was investigating this Resize test failure, and I couldn't figure out the reason for one of the MDSes to start lagging. It looks like the mdslock held locked, but I doubt it could be the result of the quiesce db, as it's empty in all these tests.

it's in the ceph-mds.b.log:

2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.b Updating MDS map to version 294 from mon.2
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b my gid is 7109
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b map says I am mds.2.289 state up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b msgr says I am [v2:172.21.15.177:6832/2547507192,v1:172.21.15.177:6834/2547507192]
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b handle_mds_map: handling map as rank 2
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map I am now mds.2.289
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map state change up:active --> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.beacon.b set_want_state: up:active -> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  2 mds.2.289 Stopping...
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.2.cache shutdown_start
....
2024-03-04T07:42:47.800+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-04T07:42:47.800+0000 7f5886e00640  0 mds.beacon.b Skipping beacon heartbeat to monitors (last acked 4.00088s ago); MDS internal heartbeat is not healthy!
2024-03-04T07:42:47.800+0000 7f5886e00640 20 mds.beacon.b sender thread waiting interval 0.5s
2024-03-04T07:42:48.301+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

Have you checked

I don't see these failures in the main branch run - I ran the fs suite yesterday and those do not have there failures. How reproducible are these?

With the change we can now avoid having to join it during the membership update, preventing potential deadlocks Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

leonid-s-usov · 2024-03-07T08:22:32Z

@vshankar I see that the issue was fixed with the latest version. Here are two dedicated filtered runs

failover: Latest vs Previous (reproduction)
stray: Latest vs Previous (reproduction)

And the latest full fs suite run is still going as of writing this.

leonid-s-usov · 2024-03-07T08:32:20Z

jenkins test dashboard

leonid-s-usov · 2024-03-07T08:32:31Z

jenkins test dashboard cephadm

vshankar · 2024-03-08T14:06:15Z

@vshankar I see that the issue was fixed with the latest version. Here are two dedicated filtered runs

failover: Latest vs Previous (reproduction)

stray: Latest vs Previous (reproduction)

And the latest full fs suite run is still going as of writing this.

Looks good. Have you checked these two failures if those are related to this change?

leonid-s-usov · 2024-03-08T15:29:29Z

Thank you @vshankar ! I will be looking at

https://pulpito.ceph.com/leonidus-2024-03-07_05:46:12-fs-wip-lusov-qdb-distro-default-smithi/7584070/
https://pulpito.ceph.com/leonidus-2024-03-07_05:46:12-fs-wip-lusov-qdb-distro-default-smithi/7584074/

leonid-s-usov · 2024-03-09T10:46:53Z

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.

In the meantime, I scheduled a 20-set of the same test twice:

with this PR's build
with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

vshankar · 2024-03-11T07:42:57Z

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.

In the meantime, I scheduled a 20-set of the same test twice:

with this PR's build

with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

The untar_snap_rm.sh failure is a kclient bug. See - https://tracker.ceph.com/issues/64679#note-4.

vshankar · 2024-03-11T11:58:15Z

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.
In the meantime, I scheduled a 20-set of the same test twice:

with this PR's build

with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

The untar_snap_rm.sh failure is a kclient bug. See - https://tracker.ceph.com/issues/64679#note-4.

And the kerne_build_untar failure too is related to the above. We would want to rerun the failed jobs once the issue is fixed. Otherwise LGTM 👍

batrick · 2024-03-12T15:58:57Z

jenkins test make check arm64

batrick · 2024-03-12T15:59:04Z

jenkins test dashboard

batrick · 2024-03-12T15:59:14Z

jenkins test dashboard cephadm

vshankar

Fantastic work!

leonid-s-usov requested a review from batrick November 13, 2023 18:58

github-actions bot added build/ops cephfs Ceph File System common tests labels Nov 13, 2023

leonid-s-usov marked this pull request as draft November 13, 2023 19:00

leonid-s-usov self-assigned this Nov 13, 2023

leonid-s-usov force-pushed the quiesce-db branch 2 times, most recently from bd2ae3d to f14f0d2 Compare November 14, 2023 07:04

leonid-s-usov force-pushed the quiesce-db branch 3 times, most recently from feeeaa9 to c59bdc7 Compare November 20, 2023 18:14

kotreshhr self-requested a review November 21, 2023 13:45

leonid-s-usov force-pushed the quiesce-db branch 4 times, most recently from f1a70e3 to c6e5769 Compare November 30, 2023 14:22

leonid-s-usov changed the title ~~mds/QuiesceDb: implementation and primary unit tests of the new MDS component~~ mds/QuiesceDb: Manager, Agent and single-rank integration Nov 30, 2023

leonid-s-usov force-pushed the quiesce-db branch 5 times, most recently from 7c095a7 to d972227 Compare December 1, 2023 14:28

batrick requested changes Dec 1, 2023

View reviewed changes

leonid-s-usov force-pushed the quiesce-db branch 4 times, most recently from dcef60c to ecd1b9b Compare December 3, 2023 19:25

leonid-s-usov added 11 commits March 4, 2024 13:48

mds/quiesce: MDSRankQuiesce - integration of the quiesce db manager

edf4bce

* create an instance of the QuiesceDbManager in the rank * update membership with a new mdsmap * add an admin socket command for sending requests to the manager Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds,messages: quiesce db inter-rank messaging

82f3dbc

Fixes: https://tracker.ceph.com/issues/63708 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/quiesce: only use ACTIVE daemons for the quiesce cluster

7e42824

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

pybind/mgr: correct type hints for get_quiesce_leader_info

12d687b

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

doc/cephfs/fs-volumes: doc fixes and updates

629ffe1

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/quiesce-db: incorporate review comments

9846d35

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

messages: avoid using mutable members in MMDSQuiesce*

2fbe40e

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

include/types: add an I/O helper for std::unordered_map

42a5fb3

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/quiesce: resolve the quiesce cluster at the mds monitor

7599257

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/quiesce: declare QuiesceDbPeerListing and QuiesceDbPeerAck

205fd33

With these dedicated structs we can fully defer to QuiesceDbEncoding when encoding/decoding quiesce db messages Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

mds/quiesce-db: incorporate review comments

3e012f7

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

batrick approved these changes Mar 4, 2024

View reviewed changes

mds/quiesce-db: keep the db thread alive until shutdown

8b896a9

With the change we can now avoid having to join it during the membership update, preventing potential deadlocks Signed-off-by: Leonid Usov <leonid.usov@ibm.com>

vshankar approved these changes Mar 13, 2024

View reviewed changes

leonid-s-usov mentioned this pull request Mar 14, 2024

squid: mds: QuiesceDb to manage subvolume quiesce state #56202

Merged

Conversation

leonid-s-usov commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonid-s-usov commented Nov 14, 2023

Uh oh!

leonid-s-usov commented Nov 14, 2023

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leonid-s-usov commented Mar 4, 2024

Uh oh!

batrick left a comment

Choose a reason for hiding this comment

Uh oh!

batrick Mar 4, 2024

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov Mar 4, 2024

Choose a reason for hiding this comment

Uh oh!

leonid-s-usov commented Mar 4, 2024

Uh oh!

leonid-s-usov commented Mar 4, 2024

Uh oh!

vshankar commented Mar 5, 2024

Uh oh!

leonid-s-usov commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonid-s-usov commented Mar 7, 2024

Uh oh!

leonid-s-usov commented Mar 7, 2024

Uh oh!

vshankar commented Mar 8, 2024

Uh oh!

leonid-s-usov commented Mar 8, 2024

Uh oh!

leonid-s-usov commented Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vshankar commented Mar 11, 2024

Uh oh!

vshankar commented Mar 11, 2024

Uh oh!

batrick commented Mar 12, 2024

Uh oh!

batrick commented Mar 12, 2024

Uh oh!

batrick commented Mar 12, 2024

Uh oh!

vshankar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leonid-s-usov commented Nov 13, 2023 •

edited

Loading

leonid-s-usov commented Mar 7, 2024 •

edited

Loading

leonid-s-usov commented Mar 9, 2024 •

edited

Loading