mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap by batrick · Pull Request #50404 · ceph/ceph

batrick · 2023-03-06T18:23:41Z

Fixes: https://tracker.ceph.com/issues/58923

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

src/mon/MgrMonitor.cc

batrick · 2023-03-09T23:22:52Z

jenkins test api

ajarr

PR looks good.

src/mon/MgrMonitor.cc

ajarr · 2023-03-10T03:32:33Z

I also ran the local reproducer for https://tracker.ceph.com/issues/55711 mentioned in https://gist.github.com/rzarzynski/25ac59c8422e9ad0b1710a765a77f19a?permalink_comment_id=4172486#gistcomment-4172486 for ~2 hours. I didn't hit any issues.

idryomov

It would be nice to spell out the motivation in "mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap" commit message. Currently all there is is a link to a tracker ticket which itself links a comment in another PR.

src/mon/MgrMonitor.cc

batrick · 2023-03-10T19:00:21Z

It would be nice to spell out the motivation in "mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap" commit message. Currently all there is is a link to a tracker ticket which itself links a comment in another PR.

done

src/mon/MgrMonitor.cc

The return value is used to indicate whether the pending state should be committed. There is no concept of "handled message" here (unlike preprocess_query). Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

This race fixed by 23c3f7 exists wherever we drop the active mgr. Resolve this by forcing immediate proposal (circumventing any delays) whenever the active is dropped. Fixes: 23c3f76 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

Plugging PAXOS has the effect of batching map updates into a single PAXOS transaction. Since we're updating the OSDMap several times and the MgrMap, plug PAXOS for efficiency. This also has the nice effect of reducing any delay between the active mgr getting dropped and the blocklisting of its clients. This doesn't resolve any race condition as the two maps are never processed in one unit. So the former active manager may process the OSDMap blocklists before learning it is dropped from the MgrMap. Fixes: https://tracker.ceph.com/issues/58923 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

idryomov · 2023-03-23T13:41:16Z

jenkins test api

idryomov · 2023-03-23T13:41:22Z

jenkins test windows

batrick · 2023-03-28T15:12:19Z

jenkins test make check arm64

ljflores · 2023-03-31T15:33:07Z

@batrick I reviewed this in the rados suite. Overall it looks good, but I did open a new tracker for a crash in the monitor (https://tracker.ceph.com/issues/59271). Can you take a look in case it happens to be related to your changes? It is a stretch cluster bug, so I wasn't sure. But if you think it is unrelated, feel free to merge.

Also to note, the test failed once, but passed in the rerun.

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-03-25-0714

Failures:
1. https://tracker.ceph.com/issues/58946
2. https://tracker.ceph.com/issues/59196
3. https://tracker.ceph.com/issues/59271 -- new tracker
4. https://tracker.ceph.com/issues/58560
5. https://tracker.ceph.com/issues/51964
6. https://tracker.ceph.com/issues/58560
7. https://tracker.ceph.com/issues/59192

Details:
1. cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard
2. ceph_test_lazy_omap_stats segfault while waiting for active+clean - Ceph - RADOS
3. mon: FAILED ceph_assert(osdmon()->is_writeable()) - Ceph - RADOS
4. rook: failed to pull kubelet image - Ceph - Orchestrator
5. qa: test_cephfs_mirror_restart_sync_on_blocklist failure - Ceph - CephFS
6. test_envlibrados_for_rocksdb.sh failed to subscribe to repo - Ceph - RADOS
7. cls/test_cls_sdk.sh: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) - Ceph - RADOS

yuriw · 2023-03-31T15:36:40Z

ref: https://trello.com/c/epwSlEHP

batrick · 2023-04-03T12:39:50Z

jenkins test make check arm64

batrick · 2023-04-03T12:59:46Z

@batrick I reviewed this in the rados suite. Overall it looks good, but I did open a new tracker for a crash in the monitor (https://tracker.ceph.com/issues/59271). Can you take a look in case it happens to be related to your changes? It is a stretch cluster bug, so I wasn't sure. But if you think it is unrelated, feel free to merge.

I don't see how it could be related. It's a rather baffling bug since trigger_degraded_stretch_mode can only be called in that path if osdmon()->is_writeable() is true. In any case, the MgrMonitor did not propose anything for several minutes so it's unlikely this PR could be related.

batrick added bug-fix needs-review mgr labels Mar 6, 2023

batrick requested a review from ajarr March 6, 2023 18:23

batrick requested a review from a team as a code owner March 6, 2023 18:23

github-actions bot added core mon labels Mar 6, 2023

batrick force-pushed the i58923 branch 2 times, most recently from 1f5f4ab to 8e2c673 Compare March 7, 2023 01:58

ajarr reviewed Mar 7, 2023

View reviewed changes

src/mon/MgrMonitor.cc Show resolved Hide resolved

src/mon/MgrMonitor.cc Show resolved Hide resolved

ajarr approved these changes Mar 7, 2023

View reviewed changes

ajarr self-requested a review March 7, 2023 20:19

batrick force-pushed the i58923 branch from 8e2c673 to 531006c Compare March 8, 2023 19:59

batrick mentioned this pull request Mar 8, 2023

mon: fix a race between mgr fail and MgrMonitor::prepare_beacon() #46318

Merged

14 tasks

batrick force-pushed the i58923 branch 2 times, most recently from d8c64d2 to 60d3ef5 Compare March 9, 2023 18:12

ajarr reviewed Mar 9, 2023

View reviewed changes

src/mon/MgrMonitor.cc Show resolved Hide resolved

ajarr reviewed Mar 9, 2023

View reviewed changes

src/mon/MgrMonitor.cc Outdated Show resolved Hide resolved

batrick force-pushed the i58923 branch from 60d3ef5 to a8c6f2e Compare March 9, 2023 21:08

ajarr requested review from gregsfortytwo and rzarzynski March 9, 2023 23:33

ajarr approved these changes Mar 10, 2023

View reviewed changes

src/mon/MgrMonitor.cc Show resolved Hide resolved

batrick force-pushed the i58923 branch from a8c6f2e to 4f0233e Compare March 10, 2023 01:50

idryomov reviewed Mar 10, 2023

View reviewed changes

src/mon/MgrMonitor.cc Show resolved Hide resolved

src/mon/MgrMonitor.cc Outdated Show resolved Hide resolved

batrick force-pushed the i58923 branch 2 times, most recently from aefdfea to 5d77090 Compare March 10, 2023 19:00

idryomov reviewed Mar 13, 2023

View reviewed changes

src/mon/MgrMonitor.cc Show resolved Hide resolved

src/mon/MgrMonitor.cc Show resolved Hide resolved

batrick force-pushed the i58923 branch 2 times, most recently from 3e310a5 to df3139a Compare March 13, 2023 15:25

github-actions bot added the cephfs Ceph File System label Mar 13, 2023

batrick force-pushed the i58923 branch from df3139a to 76ddaba Compare March 13, 2023 16:15

batrick added 3 commits March 13, 2023 12:31

mon: fix semantic error prepare_update return

8caa1fd

The return value is used to indicate whether the pending state should be committed. There is no concept of "handled message" here (unlike preprocess_query). Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

mgr: force propose whenever the active changes

30b20d3

This race fixed by 23c3f7 exists wherever we drop the active mgr. Resolve this by forcing immediate proposal (circumventing any delays) whenever the active is dropped. Fixes: 23c3f76 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

batrick force-pushed the i58923 branch from 76ddaba to 2e057bb Compare March 13, 2023 16:31

idryomov approved these changes Mar 13, 2023

View reviewed changes

batrick added needs-qa and removed needs-review labels Mar 23, 2023

yuriw added the wip-yuri4-testing label Mar 24, 2023

batrick added the wip-pdonnell-testing3 label Mar 28, 2023

batrick removed the wip-pdonnell-testing3 label Mar 30, 2023

yuriw added TESTED and removed wip-yuri4-testing labels Mar 31, 2023

idryomov merged commit 436cc67 into ceph:main Apr 3, 2023

batrick deleted the i58923 branch April 3, 2023 15:54

This was referenced Apr 14, 2023

quincy: MgrMonitor: batch commit OSDMap and MgrMap mutations #50979

Merged

reef: MgrMonitor: batch commit OSDMap and MgrMap mutations #50978

Merged

pacific: MgrMonitor: batch commit OSDMap and MgrMap mutations #50980

Merged

Conversation

batrick commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

batrick commented Mar 9, 2023

Uh oh!

ajarr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajarr commented Mar 10, 2023

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

batrick commented Mar 10, 2023

Uh oh!

Uh oh!

Uh oh!

idryomov commented Mar 23, 2023

Uh oh!

idryomov commented Mar 23, 2023

Uh oh!

batrick commented Mar 28, 2023

Uh oh!

ljflores commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuriw commented Mar 31, 2023

Uh oh!

batrick commented Apr 3, 2023

Uh oh!

batrick commented Apr 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

batrick commented Mar 6, 2023 •

edited

Loading

ljflores commented Mar 31, 2023 •

edited

Loading