osd/scrub: add configuration parameters to control delay duration#59590
osd/scrub: add configuration parameters to control delay duration#59590
Conversation
c321b1e to
ad79764
Compare
src/common/options/osd.yaml.in
Outdated
| - name: osd_scrub_retry_delay | ||
| type: int | ||
| level: advanced | ||
| desc: Period (in seconds) before retrying a specific PG following a scrub failure |
There was a problem hiding this comment.
nit: This to me reads as though the option is independently applied to each PG. Suggest
desc: Period (in seconds) before retrying a PG that has failed a prior scrub.
src/common/options/osd.yaml.in
Outdated
| level: advanced | ||
| desc: Period (in seconds) before retrying a specific PG following a scrub failure | ||
| long_desc: Minimum delay after a failed attempt to scrub a PG. See the | ||
| 'see also' for the configuration options for some specific delay reasons |
There was a problem hiding this comment.
nit: suggest
'see also' for delay options for specific failure reasons.
I also wonder if osd_scrub_retry_delay overrides the below, if it applies to only cases that aren't among the below, or what. In other words, I'd like to see more about the relationship between this first option and the below.
There was a problem hiding this comment.
Please see my attempt in clarifying (in the 'long descr')
| desc: Period (in seconds) before retrying to scrub a PG at a specific level | ||
| after detecting a no-scrub or no-deep-scrub flag | ||
| long_desc: Minimum delay after a failed attempt to scrub a PG at a level | ||
| (shallow or deep) that is disabled by cluster or pool no-scrub or no-deep-scrub |
There was a problem hiding this comment.
I might take out the mentions of level here
There was a problem hiding this comment.
@anthonyeleven: We now have two scheduled 'targets' per each PG: one for its next shallow scrub, and one for
the next deep one. I am trying to convey the fact that a specific one of these targets is to be postponed.
There was a problem hiding this comment.
Wouldn't that require two options? osd_shallow_scrub_retry_after_noscrub and osd_deep_scrub_retry_after_noscrub?
There was a problem hiding this comment.
The delay is one - but it is applied to the relevant target - the relevant level. The one that was scheduled to execute - and aborted because of the operator setting the flag.
I have pushed a new version, with the rest of the fixes you have suggested. And in the description of
the default conf - I've expanded a bit about the two levels.
to apply to a scrub target following a scrub failure Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
shortening the delay times following various scrub events. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
allowing the configuration of lower delay times (compared to 'pg_state', now denoting PGs that are not active or not clean) for PGs that failed to be scrubbed due to performing snap-trimming. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
allowing setting specific delay times for scrubs that were aborted due to the interval being changed. The specified delay should be lower than the default delay used for the other types of mid-scrub aborts. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
ad79764 to
d7c7aa7
Compare
|
'make check' failure caused by test environment issues. Retrying. |
|
jenkins test make check |
|
Merging based on multiple Teuthology tests. Both as this branch, and as wip-rf-delay-conf-w-standalo |
That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Fixes (original): https://tracker.ceph.com/issues/50245 Fixes (Squid backport): https://tracker.ceph.com/issues/68403 (cherry picked from commit 0c4028a) Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
to apply to a scrub target following a scrub failure
Specific configuration parameters are added to control the duration
of the delay to apply to the 'not-before' attribute of the failed scrub
target following a scrub failure. Some failure causes now have
their own delay values, while others share a common duration.