osd: Extracting scrub-related code from the PG & friends#38111
osd: Extracting scrub-related code from the PG & friends#38111athanatos merged 10 commits intoceph:masterfrom
Conversation
|
@neha-ojha @ronen-fr There is a race with setting noscrub or nodeep-scrub, see https://tracker.ceph.com/issues/47767. The PG::abort_scrub() function doesn't account for messages in-flight from scrub replicas. Wondering if you can deal with this better in the state machine version. Coincidentally, the failure above is exactly with the test setting nodeep-scrub. |
|
@ronen-fr I started reviewing this version of the code. I'll be out tomorrow (Tue) back the next day. |
@dzafman, thanks for the comments. Seems to me that the two issues are unrelated:
Now - there is still a theoretical issue if the primary terminated the scrub and started a new one while the message was in-flight. MapsCollectionStatus::mark_arriving_map() will then return 'false', causing map_from_replica() to assert. I would suggest removing that specific assert. I do not think it serves a positive purpose.
|
…espace) Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
into a PrimaryLogScrub - a derivative of PgScrubber. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Fixing Crimson compilation. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(following code-review comments) main changes:
- incoming scrub events are now validated by the scrubber before being sent to the
SFM:
- messages from a previous intervals are discarded, possibly signalling a
scrub of the current scrub and full scrubber reset;
- abort newly mandated by a configuration change is acted upon;
- stale messages from previous scrub sessions are discarded;
- replica reservations are silently discarded on interval changes
- some modifications to the state diagram
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
da2b688 to
0ac87eb
Compare
|
Something strange is happening to our perf test... |
|
jenkins test classic perf |
athanatos
left a comment
There was a problem hiding this comment.
I'm going to go ahead and merge this version, please send a follow up PR handling the last few comments here.
| // we have not seen this interval change yet. | ||
| // The remote reservations are no longer relevant. | ||
|
|
||
| m_last_dead_interval = current_interval; |
There was a problem hiding this comment.
I think this variable and related logic is entirely unnecessary. When we saw the map reflecting the interval change, we called PrimaryLogPG::on_change which called PGScrubber::clear_pgscrub_state which cleared the reservations.
There was a problem hiding this comment.
turns out on_change() is too late in the process.
Looking now for the earliest location to hook the 'no messages' reservations clearing
| } | ||
|
|
||
| // returns false if the message should be discarded. Handles the notification of interval | ||
| // change, if not done already. called only for active scrub? not sure. |
There was a problem hiding this comment.
Please answer this question.
There was a problem hiding this comment.
modified (separate PR)
|
|
||
| if (epoch_to_verify < current_interval) { | ||
| // the event will not be delivered. If we have already noticed and handled | ||
| // the change of seasons, it will be silently discarded. Otherwise - we |
There was a problem hiding this comment.
We must have already handled the interval change, or get_same_interval_since() wouldn't be updated yet.
1 similar comment
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. Current PR fixes that. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326)
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326) Conflicts: src/osd/pg_scrubber.cc Conflict resolved by removing a clear_queued_or_active() call that was dragged in.
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326) Conflicts: src/osd/pg_scrubber.cc Conflict resolved by removing a clear_queued_or_active() call that was dragged in. (cherry picked from commit b010892) Resolves: rhbz#2068531
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326) Conflicts: src/osd/pg_scrubber.cc Conflict resolved by removing a clear_queued_or_active() call that was dragged in.
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326) Conflicts: src/osd/pg_scrubber.cc Conflict resolved by removing a clear_queued_or_active() call that was dragged in. (cherry picked from commit b010892) Resolves: rhbz#2068531
Snap trimming that was postponed as the target PG was scrubbing must be restarted at scrub completion. PR ceph#38111 moved trimming restart to just before the scrub fully terminated. The current PR fixes that. Trimming is also restarted in those cases where scrub was queued but aborted immediately. Fixes: https://tracker.ceph.com/issues/52026 Signed-off-by: Ronen Friedman <rfriedma@redhat.com> (cherry picked from commit 948d326) Conflicts: src/osd/pg_scrubber.cc Conflict resolved by removing a clear_queued_or_active() call that was dragged in.
This PR is a re-structured version of #37317, with the changes logically divided into separate commits for the reviewers' sake.