Skip to content

mds/QuiesceDb: Manager, Agent, a-sock i-face, mgr/volumes CLI#54485

Merged
batrick merged 24 commits intoceph:mainfrom
leonid-s-usov:quiesce-db
Mar 13, 2024
Merged

mds/QuiesceDb: Manager, Agent, a-sock i-face, mgr/volumes CLI#54485
batrick merged 24 commits intoceph:mainfrom
leonid-s-usov:quiesce-db

Conversation

@leonid-s-usov
Copy link
Contributor

@leonid-s-usov leonid-s-usov commented Nov 13, 2023

Quiesce DB is one of the components of the "Consistent Snapshots" epic.
The solution is overviewed in a slide deck available for viewing to @redhat users.

Related redmine tickets:

  • Feature #63665: mds: QuiesceDb to manage subvolume quiesce state
  • Feature #63666: mds: QuiesceAgent to execute quiesce operations on an MDS rank
  • Feature #63668: pybind/mgr/volumes: add quiesce protocol API
  • Tasks #63707: mds: AdminSocket command to control the QuiesceDbManager

This PR focuses on the replicated quiesce database maintained by the MDS rank cluster. One of the major goals was to design the component in a way that can be easily tested outside of the MDS infrastructure, which is why the communication layer has been abstracted out by introducing just two communication callbacks that will need to be implemented by the infrastructure.

Most of the component code is delivered in a single coherent commit, along with the unit tests. Other commits will be dedicated to integration with the MDS infrastructure and other changes that can't be attributed to the core quiesce db code or its tests.

The quiesce db component is composed of the following major parts/actors:

  • QuiesceDbManager is the main actor, implementing both the leader and the replica roles. Normally, there will be an instance of the manager per MDS rank, although, given the decoupling of the infrastructure and the manager, one can run any number of instances on a single node, which is how tests are working.
  • The manager interfaces to the infrastructure via two main APIs with the infrastructure that provides communication and cluster configuration (actor 2) and the quiesce db client that is responsible for the quiescing of the roots (actor 3)
    • QuiesceClusterMembership is how the manager is configured to be part of a (virtual) cluster. This structure will deliver information about other peers, and the leader, and provide two communication APIs: send_listing_to for db replication from the leader to replicas and send_ack for reporting quiesce success from the agents.
    • The client interface consists of a QuisceMap notify callback and a dedicated manager method to submit asynchronous acks following the agent (rank) quiesce progress.
  • Last but not the least, is the manager's user-facing API. The quiesce db user API implements the CLI behavior described in the slide deck mentioned above. The full scope of capabilities is encapsulated in a single QuiesceDbRequest structure. This should help with the implementation of other components that will have to propagate the functionality to the administrator user of the volumes plugin.
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@leonid-s-usov leonid-s-usov marked this pull request as draft November 13, 2023 19:00
@leonid-s-usov leonid-s-usov self-assigned this Nov 13, 2023
@leonid-s-usov leonid-s-usov force-pushed the quiesce-db branch 2 times, most recently from bd2ae3d to f14f0d2 Compare November 14, 2023 07:04
@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

1 similar comment
@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

@leonid-s-usov leonid-s-usov force-pushed the quiesce-db branch 3 times, most recently from feeeaa9 to c59bdc7 Compare November 20, 2023 18:14
@kotreshhr kotreshhr self-requested a review November 21, 2023 13:45
@leonid-s-usov leonid-s-usov force-pushed the quiesce-db branch 4 times, most recently from f1a70e3 to c6e5769 Compare November 30, 2023 14:22
@leonid-s-usov leonid-s-usov changed the title mds/QuiesceDb: implementation and primary unit tests of the new MDS component mds/QuiesceDb: Manager, Agent and single-rank integration Nov 30, 2023
@leonid-s-usov leonid-s-usov force-pushed the quiesce-db branch 5 times, most recently from 7c095a7 to d972227 Compare December 1, 2023 14:28
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have more code review to do but this is a nice checkpoint. I really appreciate your efforts to make this unit testable. Well done!

@leonid-s-usov leonid-s-usov force-pushed the quiesce-db branch 4 times, most recently from dcef60c to ecd1b9b Compare December 3, 2023 19:25
* create an instance of the QuiesceDbManager in the rank
* update membership with a new mdsmap
* add an admin socket command for sending requests to the manager

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Fixes: https://tracker.ceph.com/issues/63708
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
With these dedicated structs we can fully defer to QuiesceDbEncoding
when encoding/decoding quiesce db messages

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov
Copy link
Contributor Author

@vshankar here's the latest re-run of the fs suite on this PR post the latest changes to address the review comments. That included utilizing the Monitor map for calculating and broadcasting the common cluster state.

https://pulpito.ceph.com/leonidus-2024-03-04_07:05:03-fs-wip-lusov-qdb-distro-default-smithi/

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to qa/merge.

auto const& leader = fs.mds_map.get_quiesce_db_cluster_leader();
auto const& members = fs.mds_map.get_quiesce_db_cluster_members();
ceph_assert(leader == MDS_GID_NONE || members.contains(leader));
ceph_assert(std::ranges::all_of(members, [&infos = fs.mds_map.mds_info](auto m){return infos.contains(m);}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last assert may backfire on you. If you ever change it so members may exist outside the MDSMap, then any monitor that decodes (update_from_paxos) the change (without being upgraded) will fail this sanity check.

However, I think the likelihood of that is low and we have in the past indicated that a monitor upgrade should disable the fsmap sanity checks for this type of problem (which we don't like doing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'll make sure to only update the logic once all monitors are on the new version.

@leonid-s-usov
Copy link
Contributor Author

Thanks Patrick! 🎉
@vshankar now we just need a green light from the fs suite. Please let me know if you find any red flags in that run above.

I was investigating this Resize test failure, and I couldn't figure out the reason for one of the MDSes to start lagging. It looks like the mdslock held locked, but I doubt it could be the result of the quiesce db, as it's empty in all these tests.

it's in the ceph-mds.b.log:

2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.b Updating MDS map to version 294 from mon.2
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b my gid is 7109
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b map says I am mds.2.289 state up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b msgr says I am [v2:172.21.15.177:6832/2547507192,v1:172.21.15.177:6834/2547507192]
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b handle_mds_map: handling map as rank 2
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map I am now mds.2.289
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map state change up:active --> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.beacon.b set_want_state: up:active -> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  2 mds.2.289 Stopping...
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.2.cache shutdown_start
....
2024-03-04T07:42:47.800+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-04T07:42:47.800+0000 7f5886e00640  0 mds.beacon.b Skipping beacon heartbeat to monitors (last acked 4.00088s ago); MDS internal heartbeat is not healthy!
2024-03-04T07:42:47.800+0000 7f5886e00640 20 mds.beacon.b sender thread waiting interval 0.5s
2024-03-04T07:42:48.301+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

@leonid-s-usov
Copy link
Contributor Author

jenkins test make check

@vshankar
Copy link
Contributor

vshankar commented Mar 5, 2024

Thanks Patrick! 🎉 @vshankar now we just need a green light from the fs suite. Please let me know if you find any red flags in that run above.

I was investigating this Resize test failure, and I couldn't figure out the reason for one of the MDSes to start lagging. It looks like the mdslock held locked, but I doubt it could be the result of the quiesce db, as it's empty in all these tests.

it's in the ceph-mds.b.log:

2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.b Updating MDS map to version 294 from mon.2
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments}
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b my gid is 7109
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b map says I am mds.2.289 state up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b msgr says I am [v2:172.21.15.177:6832/2547507192,v1:172.21.15.177:6834/2547507192]
2024-03-04T07:42:28.809+0000 7f5889600640 10 mds.b handle_mds_map: handling map as rank 2
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map I am now mds.2.289
2024-03-04T07:42:28.809+0000 7f5889600640  1 mds.2.289 handle_mds_map state change up:active --> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.beacon.b set_want_state: up:active -> up:stopping
2024-03-04T07:42:28.809+0000 7f5889600640  2 mds.2.289 Stopping...
2024-03-04T07:42:28.809+0000 7f5889600640  5 mds.2.cache shutdown_start
....
2024-03-04T07:42:47.800+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-03-04T07:42:47.800+0000 7f5886e00640  0 mds.beacon.b Skipping beacon heartbeat to monitors (last acked 4.00088s ago); MDS internal heartbeat is not healthy!
2024-03-04T07:42:47.800+0000 7f5886e00640 20 mds.beacon.b sender thread waiting interval 0.5s
2024-03-04T07:42:48.301+0000 7f5886e00640  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

Have you checked

I don't see these failures in the main branch run - I ran the fs suite yesterday and those do not have there failures. How reproducible are these?

With the change we can now avoid having to join it during the membership update, preventing potential deadlocks

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Mar 7, 2024

@vshankar I see that the issue was fixed with the latest version. Here are two dedicated filtered runs

And the latest full fs suite run is still going as of writing this.

@leonid-s-usov
Copy link
Contributor Author

jenkins test dashboard

@leonid-s-usov
Copy link
Contributor Author

jenkins test dashboard cephadm

@vshankar
Copy link
Contributor

vshankar commented Mar 8, 2024

@vshankar I see that the issue was fixed with the latest version. Here are two dedicated filtered runs

And the latest full fs suite run is still going as of writing this.

Looks good. Have you checked these two failures if those are related to this change?

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Mar 9, 2024

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.

In the meantime, I scheduled a 20-set of the same test twice:

  1. with this PR's build
  2. with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

@vshankar
Copy link
Contributor

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.

In the meantime, I scheduled a 20-set of the same test twice:

  1. with this PR's build
  2. with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

The untar_snap_rm.sh failure is a kclient bug. See - https://tracker.ceph.com/issues/64679#note-4.

@vshankar
Copy link
Contributor

@vshankar I looked at the logs but the cause of the failure wasn't obvious to me. I'm still looking for the cause, and planning to inspect the MDS logs.
In the meantime, I scheduled a 20-set of the same test twice:

  1. with this PR's build
  2. with the build this PR is based off (-S 4a1c26b52121803d1bd0f8c1c06eb856f2add307)

Both sets have a 100% failure rate, making this PR an unlikely cause of the failure.

The untar_snap_rm.sh failure is a kclient bug. See - https://tracker.ceph.com/issues/64679#note-4.

And the kerne_build_untar failure too is related to the above. We would want to rerun the failed jobs once the issue is fixed. Otherwise LGTM 👍

@batrick
Copy link
Member

batrick commented Mar 12, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented Mar 12, 2024

jenkins test dashboard

@batrick
Copy link
Member

batrick commented Mar 12, 2024

jenkins test dashboard cephadm

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants