Skip to content

monclient: try to resend the mon commands to the same monitor if avai…#57718

Merged
SrinivasaBharath merged 3 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-monclient-try-resent-mon-command-to-same-mon
Feb 6, 2025
Merged

monclient: try to resend the mon commands to the same monitor if avai…#57718
SrinivasaBharath merged 3 commits intoceph:mainfrom
NitzanMordhai:wip-nitzan-monclient-try-resent-mon-command-to-same-mon

Conversation

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai commented May 27, 2024

…lable

When we have a socket failure or connection issue, we may send a mon command
and never check if it completed. If we resend the command to another monitor,
the resent command may complete before the first sent command. This can cause
users to send the command twice, which can lead to issues in automated
environments. For example:

We have 2 monitors: mon.a and mon.b

  1. Send command to delete pool - monclient targets mon.a
  2. A socket failure occurs, and mon.a has a delay in response
  3. Monclient hunts for another monitor to resend the delete pool command
    and finds mon.b
  4. Mon.b removes the pool and sends an acknowledgment
  5. The user script now sends a create pool command, but mon.a now sends the
    acknowledgment for the pool delete from step 1

We end up without a pool, as mon.a deleted it.

The mon_client_hunt_on_resent configuration was added to control the behavior of
retrying commands on monitor connection failures.
By default, this option is enabled to prevent situations where a command is retried
on the same monitor, potentially missing better monitor candidates.
Clients experiencing specific conditions that require retrying on the same monitor
can disable this feature by setting the configuration to false.

Fixes: https://tracker.ceph.com/issues/63789
Signed-off-by: Nitzan Mordechai nmordec@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ljflores
Copy link
Member

ljflores commented Jun 3, 2024

@NitzanMordhai looks like this is in draft form, but we saw this on bug scrub so please assign a reviewer whenever it's ready!

@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-monclient-try-resent-mon-command-to-same-mon branch from 19f3edc to ee8e66a Compare June 4, 2024 05:11
@github-actions github-actions bot added the common label Jun 4, 2024
@NitzanMordhai NitzanMordhai marked this pull request as ready for review June 4, 2024 05:11
@NitzanMordhai NitzanMordhai requested a review from a team as a code owner June 4, 2024 05:11
@NitzanMordhai NitzanMordhai requested a review from rzarzynski June 4, 2024 05:14
@rzarzynski rzarzynski requested a review from bill-scales June 10, 2024 19:17
cmd->sent_name = monmap.get_name(active_con->get_con()->get_peer_addr());
} else if (active_con && cmd->sent_name.length() &&
cmd->sent_name != monmap.get_name(active_con->get_con()->get_peer_addr()) &&
monmap.contains(cmd->sent_name)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could simplify to

(active_con && cmd->sent_name != monmap.get_name(active_con->get_con()->get_peer_addr()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can have situation that this mon disconnected and is no longer part of the monmap, we will try to reopen session for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have if (!monmap.contains(cmd->sent_name)) above, so technically you don't need the opposite predicate monmap.contains(cmd_sent_name) in the else statement

…lable

When we have a socket failure or connection issue, we may send a mon command
and never check if it completed. If we resend the command to another monitor,
the resent command may complete before the first sent command. This can cause
users to send the command twice, which can lead to issues in automated
environments. For example:

We have 2 monitors: mon.a and mon.b

1. Send command to delete pool - monclient targets mon.a
2. A socket failure occurs, and mon.a has a delay in response
3. Monclient hunts for another monitor to resend the delete pool command
   and finds mon.b
4. Mon.b removes the pool and sends an acknowledgment
5. The user script now sends a create pool command, but mon.a now sends the
   acknowledgment for the pool delete from step #1

We end up without a pool, as mon.a deleted it.

The mon_client_hunt_on_resent configuration was added to control the behavior of
retrying commands on monitor connection failures.
By default, this option is enabled to prevent situations where a command is retried
on the same monitor, potentially missing better monitor candidates.
Clients experiencing specific conditions that require retrying on the same monitor
can disable this feature by setting the configuration to false.

Fixes: https://tracker.ceph.com/issues/63789
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
mon_clent_hunt_on_resend is default to true, we want to disable it
and let that test resend the commands to the same monitor that was
failed.

Fixes: https://tracker.ceph.com/issues/63789
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
@NitzanMordhai NitzanMordhai force-pushed the wip-nitzan-monclient-try-resent-mon-command-to-same-mon branch from ee8e66a to 08112e6 Compare June 16, 2024 13:02
@github-actions github-actions bot added the tests label Jun 16, 2024
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 16, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Sep 15, 2024
@rzarzynski rzarzynski reopened this Dec 9, 2024
@rzarzynski
Copy link
Contributor

jenkins retest this please

@rzarzynski
Copy link
Contributor

Hi @bill-scales, would you mind re-review?

@github-actions github-actions bot removed the stale label Dec 9, 2024
Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to approve, just the compile error needs fixing

Co-authored-by: Bill Scales <156200352+bill-scales@users.noreply.github.com>
Signed-off-by: NitzanMordhai <97529641+NitzanMordhai@users.noreply.github.com>
@ljflores
Copy link
Member

jenkins retest this please

@ljflores
Copy link
Member

Hi all, apologies for the delay in QA approval. There was a problem with another PR in the batch, which took some time to work out. But we are retesting and will have an update soon.

@ljflores
Copy link
Member

ljflores commented Feb 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants