Skip to content

crimson/osd/pg_recovery: backoff if the recovery/backfill is deferred#56806

Merged
Matan-B merged 2 commits intoceph:mainfrom
xxhdx1985126:wip-65399
Apr 30, 2024
Merged

crimson/osd/pg_recovery: backoff if the recovery/backfill is deferred#56806
Matan-B merged 2 commits intoceph:mainfrom
xxhdx1985126:wip-65399

Conversation

@xxhdx1985126
Copy link
Copy Markdown
Member

Fixes: https://tracker.ceph.com/issues/65399
Signed-off-by: Xuehan Xu xuxuehan@qianxin.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@xxhdx1985126 xxhdx1985126 requested a review from a team as a code owner April 10, 2024 06:56
@xxhdx1985126
Copy link
Copy Markdown
Member Author

jenkins test windows

@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 10, 2024

How is this scenario possible?
Were you able to reproduce it locally?

@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Apr 10, 2024

How is this scenario possible?

The following op sequence should lead to this scenario:

  1. PG P's recovery is deferred and its pg state turns to PG_STATE_RECOVERY_WAIT;
  2. P's last recovery op triggerred by the last round of start_recovery_ops is finished;
  3. the next round of start_recovery_ops starts and hit the assert assert(pg->is_recovering())

Were you able to reproduce it locally?

Yes, this came up during our teuthology tests

@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 10, 2024

The following op sequence should lead to this scenario:

  1. PG P's recovery is deferred and its pg state turns to PG_STATE_RECOVERY_WAIT;
  2. P's last recovery op triggerred by the last round of start_recovery_ops is finished;
  3. the next round of start_recovery_ops starts and hit the assert assert(pg->is_recovering())

Why "the next round of start_recovery_ops" started again? Is that expected?

@xxhdx1985126
Copy link
Copy Markdown
Member Author

The following op sequence should lead to this scenario:

  1. PG P's recovery is deferred and its pg state turns to PG_STATE_RECOVERY_WAIT;
  2. P's last recovery op triggerred by the last round of start_recovery_ops is finished;
  3. the next round of start_recovery_ops starts and hit the assert assert(pg->is_recovering())

Why "the next round of start_recovery_ops" started again? Is that expected?

Yes, it is expected. The recovery of objects is done in batches, with the previous batch recovered, the next batch starts automatically.

So, when a recovery/backfill preemption happens, we need to stop the current recovery.

Copy link
Copy Markdown
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, probably worth another pair of eyes.

@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 17, 2024

@xxhdx1985126
Copy link
Copy Markdown
Member Author

@Matan-B The previous version of this PR can lead to infinite recursive call of "Operation::with_throttle_while", I've just added a commit to avoid this, please take a look again, thanks:-)

@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 21, 2024

@Matan-B The previous version of this PR can lead to infinite recursive call of "Operation::with_throttle_while", I've just added a commit to avoid this, please take a look again, thanks:-)

Only the first commit is tested and is ready to merge. I have separated the second commit into #57014.
We can:

  1. Drop the second commit and merge this PR (only 61c2768)
  2. Close crimson/osd/osd_operation: fix with_throttle_while #57014 and retest this PR with both commits (after reviewing again).

@xxhdx1985126, Let me know what do you prefer. I'm leaning toward the former.

@xxhdx1985126
Copy link
Copy Markdown
Member Author

xxhdx1985126 commented Apr 22, 2024

@Matan-B The previous version of this PR can lead to infinite recursive call of "Operation::with_throttle_while", I've just added a commit to avoid this, please take a look again, thanks:-)

Only the first commit is tested and is ready to merge. I have separated the second commit into #57014. We can:

  1. Drop the second commit and merge this PR (only 61c2768)
  2. Close crimson/osd/osd_operation: fix with_throttle_while #57014 and retest this PR with both commits (after reviewing again).

@xxhdx1985126, Let me know what do you prefer. I'm leaning toward the former.

I think it might be better to merge the two commits together, because the infinite-recursion regression will be brought into the main branch otherwise. What do you think?

@Matan-B Matan-B removed the TESTED label Apr 22, 2024
@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 22, 2024

jenkins test api

@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 22, 2024

jenkins test make check

xxhdx1985126 and others added 2 commits April 29, 2024 17:27
"with_throttle"

This is to avoid infinite recursive calling to "with_throttle_while"

Signed-off-by: Xuehan Xu <xxhdx1985126@gmail.com>
@Matan-B
Copy link
Copy Markdown
Contributor

Matan-B commented Apr 30, 2024

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants