mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap#50404
mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap#50404
Conversation
1f5f4ab to
8e2c673
Compare
d8c64d2 to
60d3ef5
Compare
|
jenkins test api |
|
I also ran the local reproducer for https://tracker.ceph.com/issues/55711 mentioned in https://gist.github.com/rzarzynski/25ac59c8422e9ad0b1710a765a77f19a?permalink_comment_id=4172486#gistcomment-4172486 for ~2 hours. I didn't hit any issues. |
idryomov
left a comment
There was a problem hiding this comment.
It would be nice to spell out the motivation in "mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap" commit message. Currently all there is is a link to a tracker ticket which itself links a comment in another PR.
aefdfea to
5d77090
Compare
done |
3e310a5 to
df3139a
Compare
The return value is used to indicate whether the pending state should be committed. There is no concept of "handled message" here (unlike preprocess_query). Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This race fixed by 23c3f7 exists wherever we drop the active mgr. Resolve this by forcing immediate proposal (circumventing any delays) whenever the active is dropped. Fixes: 23c3f76 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Plugging PAXOS has the effect of batching map updates into a single PAXOS transaction. Since we're updating the OSDMap several times and the MgrMap, plug PAXOS for efficiency. This also has the nice effect of reducing any delay between the active mgr getting dropped and the blocklisting of its clients. This doesn't resolve any race condition as the two maps are never processed in one unit. So the former active manager may process the OSDMap blocklists before learning it is dropped from the MgrMap. Fixes: https://tracker.ceph.com/issues/58923 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
|
jenkins test api |
|
jenkins test windows |
|
jenkins test make check arm64 |
|
@batrick I reviewed this in the rados suite. Overall it looks good, but I did open a new tracker for a crash in the monitor (https://tracker.ceph.com/issues/59271). Can you take a look in case it happens to be related to your changes? It is a stretch cluster bug, so I wasn't sure. But if you think it is unrelated, feel free to merge. Also to note, the test failed once, but passed in the rerun. Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-03-25-0714 Failures: Details: |
|
jenkins test make check arm64 |
I don't see how it could be related. It's a rather baffling bug since |
Fixes: https://tracker.ceph.com/issues/58923
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows