Skip to content

osd: fix the scrubber behavior on multiple preemption attempts#39145

Merged
neha-ojha merged 2 commits intoceph:masterfrom
ronen-fr:wip-ronenf-scrub-48793
Feb 3, 2021
Merged

osd: fix the scrubber behavior on multiple preemption attempts#39145
neha-ojha merged 2 commits intoceph:masterfrom
ronen-fr:wip-ronenf-scrub-48793

Conversation

@ronen-fr
Copy link
Contributor

Latest scrub code creates a time window in which a specific scrub
is marked as "preempted", but future preemptions are prohibited.
Write operations handled are then blocked but not restarted on time.

Fixes: https://tracker.ceph.com/issues/48793

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@github-actions github-actions bot added the core label Jan 28, 2021
@ronen-fr
Copy link
Contributor Author

jenkins retest this please

@ronen-fr ronen-fr marked this pull request as ready for review January 28, 2021 17:38
*/
void PgScrubber::add_delayed_scheduling()
{
m_end = m_start; // not blocking any range now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one. Perhaps this should be set when we complete a chunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to your question: this is what we would have done if we were able to complete the chunk. But as we were preempted:

There are two paths for preemption:

  1. locally, which means we are either in state BuildMap or in WaitReplicas:
    We won't be scrubbing anything, but we still maintain the object range.
    Setting 'end' to 'start' below (l. 591) frees that block. But it's also solved by the other change - we no longer block preemption attempt
    (but that path is slower).

  2. a replica might signal in its returning message that it was preempted (and will not be providing a map).
    The primary may still be working on creating its own map.
    But then we notice the preemption, and while we
    (unlike the original code) do not retry selecting a range immediately (we wait for all replicas, then spend time requeuing
    vie PendingTimer), at least we allow multiple preemptions.
    And - once we get to PendingTimer - there is no reason for us to maintain our hold on the old chunk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one. Perhaps this should be set when we complete a chunk?

m_start = m_end would be to complete a chunk

m_end = m_start when aborting/preempting the current chunk to stop blocking any range [m_start, m_end)


// signal the preemption
preemption_data.do_preempt();
m_end = m_start; // free the range we were scrubbing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only caller of do_preempt()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Latest scrub code creates a time window in which a specific scrub
is marked as "preempted", but future preemptions are prohibited.
Write operations handled are then blocked but not restarted on time.

Fixes: https://tracker.ceph.com/issues/48793

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@ronen-fr ronen-fr force-pushed the wip-ronenf-scrub-48793 branch from a864dc4 to 4cdb293 Compare January 29, 2021 09:52
@ronen-fr ronen-fr requested a review from athanatos January 29, 2021 10:10
@ronen-fr
Copy link
Contributor Author

https://pulpito.ceph.com/rfriedma-2021-01-28_20:14:16-rados-wip-ronenf-scrub-48793-distro-basic-smithi/
a small test run (filtered with 'thrash'). The one failure does not seem related. (?)

Waiting for longer tests.

// otherwise - write requests arriving while 'already preempted' is set
// but 'preemptable' is not - will not be allowed to continue, and will
// not be requeued on time.
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels dangerous to ignore m_start -> m_end blocking range, but I assume that we are guaranteeing that the current scrub chunk will be aborted. Somewhere m_end = m_start might happen before eventually restarting that chunk or a smaller chunk at m_start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will verify. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a scenario in which we have a problem with modifying 'end' in that direction (i.e. to be equal to 'start').
The current chunk will be aborted. And until that happens:
even the backend building of the local map is using its own copy of the start and end markers.

@neha-ojha
Copy link
Member

neha-ojha commented Feb 3, 2021

https://pulpito.ceph.com/nojha-2021-02-01_21:31:14-rados-wip-39145-distro-basic-smithi/ - no related failures or out of order issues in this run

@neha-ojha
Copy link
Member

@dzafman @athanatos This PR fixes a bug which causes multiple dead jobs in every rados run. If you are happy with this version, I'd like to merge it.

@neha-ojha neha-ojha merged commit 0dbc0b6 into ceph:master Feb 3, 2021
Copy link
Contributor

@dzafman dzafman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants