Skip to content

rbd-mirror: clean up stale pool replayers and callouts better#57082

Merged
idryomov merged 3 commits intoceph:mainfrom
idryomov:wip-65487
May 6, 2024
Merged

rbd-mirror: clean up stale pool replayers and callouts better#57082
idryomov merged 3 commits intoceph:mainfrom
idryomov:wip-65487

Conversation

@idryomov
Copy link
Contributor

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

The code in Mirror::update_pool_replayers() responsible for shutting
down and removing stale pool replayers kicks in only in case the peer
is removed, but not if the peer changes.  However, the code responsible
for (re)starting pool replayers in the same method _does_ create and
start a new pool replayer in that case.  As a result, we can end up
with nearly identical pool replayers running at the same time, hogging
OS resources and confusing instance_id tracking logic and mirror status
reporting at the very least.

The root cause is that PeerSpec is matched normally (i.e. based on all
fields) when it comes to m_pool_replayers, and based only on UUID when
it comes to pool_peers.  This was missed in commit 5463e1a
("rbd-mirror: extract optional peer mon_host/key values from MON").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov idryomov marked this pull request as ready for review April 25, 2024 08:34
@idryomov idryomov requested a review from a team as a code owner April 25, 2024 08:34
@idryomov
Copy link
Contributor Author

jenkins test windows

@idryomov
Copy link
Contributor Author

idryomov commented May 5, 2024

  • Nir reported that with this PR (backported to 18.2.3) he no longer sees the intermittent hang during deployment
  • Added an integration test

If a pool replayer is removed in an error state (e.g. after failing to
connect to the remote cluster), its callout should be removed as well.
Otherwise, the error would persist causing "daemon health: ERROR"
status to be reported even after a new pool replayer is created and
started successfully.

Fixes: https://tracker.ceph.com/issues/65487
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
wait_for_replay_complete() doesn't wait for image status to get
updated.  This didn't matter previously because these tests are run on
two different pools and nothing else was following.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov
Copy link
Contributor Author

idryomov commented May 6, 2024

  • Added an integration test

Had to tighten preceding tests to fix a sporadic failure on a pre-condition assert in the new integration test.

@idryomov
Copy link
Contributor Author

idryomov commented May 6, 2024

@idryomov
Copy link
Contributor Author

idryomov commented May 6, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants