monclient: try to resend the mon commands to the same monitor if avai… by NitzanMordhai · Pull Request #57718 · ceph/ceph

NitzanMordhai · 2024-05-27T08:30:51Z

…lable

When we have a socket failure or connection issue, we may send a mon command
and never check if it completed. If we resend the command to another monitor,
the resent command may complete before the first sent command. This can cause
users to send the command twice, which can lead to issues in automated
environments. For example:

We have 2 monitors: mon.a and mon.b

Send command to delete pool - monclient targets mon.a
A socket failure occurs, and mon.a has a delay in response
Monclient hunts for another monitor to resend the delete pool command
and finds mon.b
Mon.b removes the pool and sends an acknowledgment
The user script now sends a create pool command, but mon.a now sends the
acknowledgment for the pool delete from step 1

We end up without a pool, as mon.a deleted it.

The mon_client_hunt_on_resent configuration was added to control the behavior of
retrying commands on monitor connection failures.
By default, this option is enabled to prevent situations where a command is retried
on the same monitor, potentially missing better monitor candidates.
Clients experiencing specific conditions that require retrying on the same monitor
can disable this feature by setting the configuration to false.

Fixes: https://tracker.ceph.com/issues/63789
Signed-off-by: Nitzan Mordechai nmordec@redhat.com

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

ljflores · 2024-06-03T18:21:46Z

@NitzanMordhai looks like this is in draft form, but we saw this on bug scrub so please assign a reviewer whenever it's ready!

src/mon/MonClient.cc

bill-scales · 2024-06-11T09:16:58Z

src/mon/MonClient.cc

+          cmd->sent_name = monmap.get_name(active_con->get_con()->get_peer_addr());
+        } else if (active_con && cmd->sent_name.length() &&
+                   cmd->sent_name != monmap.get_name(active_con->get_con()->get_peer_addr()) &&
+                   monmap.contains(cmd->sent_name)) {


Could simplify to

(active_con && cmd->sent_name != monmap.get_name(active_con->get_con()->get_peer_addr()))

but we can have situation that this mon disconnected and is no longer part of the monmap, we will try to reopen session for it

You have if (!monmap.contains(cmd->sent_name)) above, so technically you don't need the opposite predicate monmap.contains(cmd_sent_name) in the else statement

src/mon/MonClient.cc

src/common/options/global.yaml.in

…lable When we have a socket failure or connection issue, we may send a mon command and never check if it completed. If we resend the command to another monitor, the resent command may complete before the first sent command. This can cause users to send the command twice, which can lead to issues in automated environments. For example: We have 2 monitors: mon.a and mon.b 1. Send command to delete pool - monclient targets mon.a 2. A socket failure occurs, and mon.a has a delay in response 3. Monclient hunts for another monitor to resend the delete pool command and finds mon.b 4. Mon.b removes the pool and sends an acknowledgment 5. The user script now sends a create pool command, but mon.a now sends the acknowledgment for the pool delete from step #1 We end up without a pool, as mon.a deleted it. The mon_client_hunt_on_resent configuration was added to control the behavior of retrying commands on monitor connection failures. By default, this option is enabled to prevent situations where a command is retried on the same monitor, potentially missing better monitor candidates. Clients experiencing specific conditions that require retrying on the same monitor can disable this feature by setting the configuration to false. Fixes: https://tracker.ceph.com/issues/63789 Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>

mon_clent_hunt_on_resend is default to true, we want to disable it and let that test resend the commands to the same monitor that was failed. Fixes: https://tracker.ceph.com/issues/63789 Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>

github-actions · 2024-08-16T09:02:39Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-09-15T10:01:40Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

rzarzynski · 2024-12-09T19:10:19Z

jenkins retest this please

rzarzynski · 2024-12-09T20:37:47Z

Hi @bill-scales, would you mind re-review?

bill-scales

Happy to approve, just the compile error needs fixing

src/mon/MonClient.cc

Co-authored-by: Bill Scales <156200352+bill-scales@users.noreply.github.com> Signed-off-by: NitzanMordhai <97529641+NitzanMordhai@users.noreply.github.com>

ljflores · 2024-12-16T19:44:34Z

jenkins retest this please

ljflores · 2025-01-31T18:38:06Z

Hi all, apologies for the delay in QA approval. There was a problem with another PR in the batch, which took some time to work out. But we are retesting and will have an update soon.

ljflores · 2025-02-05T23:48:48Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues69329

github-actions bot added core mon labels May 27, 2024

NitzanMordhai force-pushed the wip-nitzan-monclient-try-resent-mon-command-to-same-mon branch from 19f3edc to ee8e66a Compare June 4, 2024 05:11

github-actions bot added the common label Jun 4, 2024

NitzanMordhai marked this pull request as ready for review June 4, 2024 05:11

NitzanMordhai requested a review from a team as a code owner June 4, 2024 05:11

NitzanMordhai requested a review from rzarzynski June 4, 2024 05:14

rzarzynski requested a review from bill-scales June 10, 2024 19:17

bill-scales reviewed Jun 11, 2024

View reviewed changes

NitzanMordhai added 2 commits June 16, 2024 12:58

NitzanMordhai force-pushed the wip-nitzan-monclient-try-resent-mon-command-to-same-mon branch from ee8e66a to 08112e6 Compare June 16, 2024 13:02

github-actions bot added the tests label Jun 16, 2024

github-actions bot added the stale label Aug 16, 2024

github-actions bot closed this Sep 15, 2024

rzarzynski reopened this Dec 9, 2024

github-actions bot removed the stale label Dec 9, 2024

bill-scales requested changes Dec 11, 2024

View reviewed changes

src/mon/MonClient.cc Outdated Show resolved Hide resolved

src/mon/MonClient.cc Show resolved Hide resolved

Update src/mon/MonClient.cc

4e2874a

Co-authored-by: Bill Scales <156200352+bill-scales@users.noreply.github.com> Signed-off-by: NitzanMordhai <97529641+NitzanMordhai@users.noreply.github.com>

NitzanMordhai requested a review from bill-scales December 11, 2024 12:27

NitzanMordhai added the needs-qa label Dec 11, 2024

bill-scales approved these changes Dec 11, 2024

View reviewed changes

SrinivasaBharath added the wip-bharath8-testing label Dec 21, 2024

SrinivasaBharath merged commit aeb7fb9 into ceph:main Feb 6, 2025

SrinivasaBharath removed the wip-bharath8-testing label Apr 22, 2025

Conversation

NitzanMordhai commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

ljflores commented Jun 3, 2024

Uh oh!

Uh oh!

Uh oh!

bill-scales Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

NitzanMordhai Jun 16, 2024

Choose a reason for hiding this comment

Uh oh!

bill-scales Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 16, 2024

Uh oh!

github-actions bot commented Sep 15, 2024

Uh oh!

rzarzynski commented Dec 9, 2024

Uh oh!

rzarzynski commented Dec 9, 2024

Uh oh!

bill-scales left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ljflores commented Dec 16, 2024

Uh oh!

ljflores commented Jan 31, 2025

Uh oh!

ljflores commented Feb 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NitzanMordhai commented May 27, 2024 •

edited

Loading