Skip to content

osd/scrub: add configuration parameters to control delay duration#59590

Merged
ronen-fr merged 5 commits intoceph:mainfrom
ronen-fr:wip-rf-delay-conf
Sep 8, 2024
Merged

osd/scrub: add configuration parameters to control delay duration#59590
ronen-fr merged 5 commits intoceph:mainfrom
ronen-fr:wip-rf-delay-conf

Conversation

@ronen-fr
Copy link
Contributor

@ronen-fr ronen-fr commented Sep 4, 2024

to apply to a scrub target following a scrub failure

Specific configuration parameters are added to control the duration
of the delay to apply to the 'not-before' attribute of the failed scrub
target following a scrub failure. Some failure causes now have
their own delay values, while others share a common duration.

@ronen-fr ronen-fr requested a review from a team September 4, 2024 08:35
@ronen-fr ronen-fr marked this pull request as ready for review September 4, 2024 09:15
@ronen-fr ronen-fr requested a review from a team as a code owner September 4, 2024 09:15
- name: osd_scrub_retry_delay
type: int
level: advanced
desc: Period (in seconds) before retrying a specific PG following a scrub failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This to me reads as though the option is independently applied to each PG. Suggest

desc: Period (in seconds) before retrying a PG that has failed a prior scrub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

level: advanced
desc: Period (in seconds) before retrying a specific PG following a scrub failure
long_desc: Minimum delay after a failed attempt to scrub a PG. See the
'see also' for the configuration options for some specific delay reasons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest

'see also' for delay options for specific failure reasons.

I also wonder if osd_scrub_retry_delay overrides the below, if it applies to only cases that aren't among the below, or what. In other words, I'd like to see more about the relationship between this first option and the below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my attempt in clarifying (in the 'long descr')

desc: Period (in seconds) before retrying to scrub a PG at a specific level
after detecting a no-scrub or no-deep-scrub flag
long_desc: Minimum delay after a failed attempt to scrub a PG at a level
(shallow or deep) that is disabled by cluster or pool no-scrub or no-deep-scrub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might take out the mentions of level here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonyeleven: We now have two scheduled 'targets' per each PG: one for its next shallow scrub, and one for
the next deep one. I am trying to convey the fact that a specific one of these targets is to be postponed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that require two options? osd_shallow_scrub_retry_after_noscrub and osd_deep_scrub_retry_after_noscrub?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay is one - but it is applied to the relevant target - the relevant level. The one that was scheduled to execute - and aborted because of the operator setting the flag.
I have pushed a new version, with the rest of the fixes you have suggested. And in the description of
the default conf - I've expanded a bit about the two levels.

to apply to a scrub target following a scrub failure

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
shortening the delay times following various scrub events.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
allowing the configuration of lower delay times (compared
to 'pg_state', now denoting PGs that are not active or
not clean) for PGs that failed to be scrubbed due to performing
snap-trimming.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
allowing setting specific delay times for scrubs that were aborted
due to the interval being changed. The specified delay should be
lower than the default delay used for the other types of
mid-scrub aborts.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@ronen-fr
Copy link
Contributor Author

ronen-fr commented Sep 7, 2024

'make check' failure caused by test environment issues. Retrying.

@ronen-fr
Copy link
Contributor Author

ronen-fr commented Sep 7, 2024

jenkins test make check

@ronen-fr
Copy link
Contributor Author

ronen-fr commented Sep 8, 2024

Merging based on multiple Teuthology tests. Both as this branch, and as wip-rf-delay-conf-w-standalo

@ronen-fr ronen-fr merged commit 8b5058c into ceph:main Sep 8, 2024
ronen-fr added a commit to ronen-fr/ceph that referenced this pull request Oct 8, 2024
That test does no longer match the actual requirements and
implementation of scrubbing.
It was already deactivated in
ceph#59590. Here - it is
fully removed, mainly for the sake of backporting.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
ronen-fr added a commit to ronen-fr/ceph that referenced this pull request Oct 10, 2024
That test does no longer match the actual requirements and
implementation of scrubbing.
It was already deactivated in
ceph#59590. Here - it is
fully removed, mainly for the sake of backporting.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
ronen-fr added a commit to ronen-fr/ceph that referenced this pull request Oct 13, 2024
That test does no longer match the actual requirements and
implementation of scrubbing.
It was already deactivated in
ceph#59590. Here - it is
fully removed, mainly for the sake of backporting.

Fixes (original): https://tracker.ceph.com/issues/50245
Fixes (Squid backport): https://tracker.ceph.com/issues/68403

(cherry picked from commit 0c4028a)
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
piyushagarwal1411 pushed a commit to piyushagarwal1411/ceph that referenced this pull request Nov 26, 2024
That test does no longer match the actual requirements and
implementation of scrubbing.
It was already deactivated in
ceph#59590. Here - it is
fully removed, mainly for the sake of backporting.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
CJary pushed a commit to CJary/ceph that referenced this pull request Jan 17, 2025
That test does no longer match the actual requirements and
implementation of scrubbing.
It was already deactivated in
ceph#59590. Here - it is
fully removed, mainly for the sake of backporting.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants