Skip to content

mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap#50404

Merged
idryomov merged 3 commits intoceph:mainfrom
batrick:i58923
Apr 3, 2023
Merged

mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap#50404
idryomov merged 3 commits intoceph:mainfrom
batrick:i58923

Conversation

@batrick
Copy link
Member

@batrick batrick commented Mar 6, 2023

Fixes: https://tracker.ceph.com/issues/58923

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@batrick batrick requested a review from ajarr March 6, 2023 18:23
@batrick batrick requested a review from a team as a code owner March 6, 2023 18:23
@batrick batrick force-pushed the i58923 branch 2 times, most recently from 1f5f4ab to 8e2c673 Compare March 7, 2023 01:58
@batrick
Copy link
Member Author

batrick commented Mar 9, 2023

jenkins test api

Copy link
Contributor

@ajarr ajarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good.

@ajarr
Copy link
Contributor

ajarr commented Mar 10, 2023

I also ran the local reproducer for https://tracker.ceph.com/issues/55711 mentioned in https://gist.github.com/rzarzynski/25ac59c8422e9ad0b1710a765a77f19a?permalink_comment_id=4172486#gistcomment-4172486 for ~2 hours. I didn't hit any issues.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to spell out the motivation in "mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap" commit message. Currently all there is is a link to a tracker ticket which itself links a comment in another PR.

@batrick batrick force-pushed the i58923 branch 2 times, most recently from aefdfea to 5d77090 Compare March 10, 2023 19:00
@batrick
Copy link
Member Author

batrick commented Mar 10, 2023

It would be nice to spell out the motivation in "mon/MgrMonitor: plug PAXOS for batched MgrMap/OSDMap" commit message. Currently all there is is a link to a tracker ticket which itself links a comment in another PR.

done

@batrick batrick force-pushed the i58923 branch 2 times, most recently from 3e310a5 to df3139a Compare March 13, 2023 15:25
@github-actions github-actions bot added the cephfs Ceph File System label Mar 13, 2023
batrick added 3 commits March 13, 2023 12:31
The return value is used to indicate whether the pending state should be
committed. There is no concept of "handled message" here (unlike
preprocess_query).

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This race fixed by 23c3f7 exists wherever we drop the active mgr.
Resolve this by forcing immediate proposal (circumventing any delays)
whenever the active is dropped.

Fixes: 23c3f76
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Plugging PAXOS has the effect of batching map updates into a single
PAXOS transaction. Since we're updating the OSDMap several times and the
MgrMap, plug PAXOS for efficiency. This also has the nice effect of
reducing any delay between the active mgr getting dropped and the
blocklisting of its clients. This doesn't resolve any race condition as
the two maps are never processed in one unit. So the former active
manager may process the OSDMap blocklists before learning it is dropped
from the MgrMap.

Fixes: https://tracker.ceph.com/issues/58923
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
@idryomov
Copy link
Contributor

jenkins test api

@idryomov
Copy link
Contributor

jenkins test windows

@batrick
Copy link
Member Author

batrick commented Mar 28, 2023

jenkins test make check arm64

@ljflores
Copy link
Member

ljflores commented Mar 31, 2023

@batrick I reviewed this in the rados suite. Overall it looks good, but I did open a new tracker for a crash in the monitor (https://tracker.ceph.com/issues/59271). Can you take a look in case it happens to be related to your changes? It is a stretch cluster bug, so I wasn't sure. But if you think it is unrelated, feel free to merge.

Also to note, the test failed once, but passed in the rerun.

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri4-testing-2023-03-25-0714

Failures:
1. https://tracker.ceph.com/issues/58946
2. https://tracker.ceph.com/issues/59196
3. https://tracker.ceph.com/issues/59271 -- new tracker
4. https://tracker.ceph.com/issues/58560
5. https://tracker.ceph.com/issues/51964
6. https://tracker.ceph.com/issues/58560
7. https://tracker.ceph.com/issues/59192

Details:
1. cephadm: KeyError: 'osdspec_affinity' - Ceph - Mgr - Dashboard
2. ceph_test_lazy_omap_stats segfault while waiting for active+clean - Ceph - RADOS
3. mon: FAILED ceph_assert(osdmon()->is_writeable()) - Ceph - RADOS
4. rook: failed to pull kubelet image - Ceph - Orchestrator
5. qa: test_cephfs_mirror_restart_sync_on_blocklist failure - Ceph - CephFS
6. test_envlibrados_for_rocksdb.sh failed to subscribe to repo - Ceph - RADOS
7. cls/test_cls_sdk.sh: Health check failed: 1 pool(s) do not have an application enabled (POOL_APP_NOT_ENABLED) - Ceph - RADOS

@yuriw
Copy link
Contributor

yuriw commented Mar 31, 2023

ref: https://trello.com/c/epwSlEHP

@batrick
Copy link
Member Author

batrick commented Apr 3, 2023

jenkins test make check arm64

@batrick
Copy link
Member Author

batrick commented Apr 3, 2023

@batrick I reviewed this in the rados suite. Overall it looks good, but I did open a new tracker for a crash in the monitor (https://tracker.ceph.com/issues/59271). Can you take a look in case it happens to be related to your changes? It is a stretch cluster bug, so I wasn't sure. But if you think it is unrelated, feel free to merge.

I don't see how it could be related. It's a rather baffling bug since trigger_degraded_stretch_mode can only be called in that path if osdmon()->is_writeable() is true. In any case, the MgrMonitor did not propose anything for several minutes so it's unlikely this PR could be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants