Skip to content

osd/scrub: fix the handling of interval changes#49959

Closed
ronen-fr wants to merge 7 commits intoceph:mainfrom
ronen-fr:wip-rf-58496-cln
Closed

osd/scrub: fix the handling of interval changes#49959
ronen-fr wants to merge 7 commits intoceph:mainfrom
ronen-fr:wip-rf-58496-cln

Conversation

@ronen-fr
Copy link
Contributor

@ronen-fr ronen-fr commented Feb 1, 2023

Separating and clarifying the handling of interval termination, new interval starting, and configuration changes:

A primary PG now registers with its OSD for scrubbing only when fully activated: on_pg_activate() called from PG::on_activate();

When the interval ends, the PG is removed from the OSD queue by stop_active_scrubs(), called from start_peering_interval().

Configuration changes are still handled by update_scrub_job().

Signed-off-by: Ronen Friedman rfriedma@redhat.com

@ronen-fr ronen-fr marked this pull request as ready for review February 1, 2023 14:09
@ronen-fr ronen-fr requested review from a team as code owners February 1, 2023 14:09
@ronen-fr ronen-fr requested review from aclamk and jdurgin February 1, 2023 14:21
@ronen-fr ronen-fr requested a review from neha-ojha February 1, 2023 16:13
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable structurally, see my comments.

This PR's single commit seems to contain several orthogonal changes. In order to review it, I needed to split it into smaller atomic commits. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-interval-pr-49959 has those commits -- please split up this PR in a similar way (commit messages in my branch are likely insufficient or incorrect, I'm not sure what the motivation was for a few of them). In the future, it would really expedite review if you'd split up your PRs in a similar manner.

// verify that the 'in_q' status matches our "Primariority"
if (m_scrub_job && is_primary() && !m_scrub_job->in_queues) {
dout(1) << __func__ << " !!! primary but not scheduled! " << dendl;
dout(10) << __func__ << " !!! primary but not scheduled! " << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is validating an invariant that should always hold, right? I would think that derr would be more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was (that's why I used dout(1) - and should have probably used derr).
I was less sure now (that's why changed to 10), and as the logic changes in that
othe scrub-sched PR - didn't invest time in making sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now - left this change out of the PR

// cleared.
replica_handling_done();
m_fsm->process_event(FullReset{});
if (m_fsm->is_active_replica()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we probably want FullReset{} to be a valid event regardless of the state of the state machine -- why not add noop handlers so that it's universally valid?

Copy link
Contributor Author

@ronen-fr ronen-fr Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the full_reset event is always effective, regardless of the FSM state (as it was meant
as a way to guarantee reset).

For the replica case - I like the suggestion of adding a new event to the state machine. Performing
this change now.

For the primary case - that would not be enough to replace the 'role' data I have added here (for the
same reason m_active does not cover the whole timespan covered by is_queued_or_active). Adding an
event for this case is only reducing clarity (I have tried that now).

'IntervalEnded' is generated whenever an interval ends while the
scrubber is acting as a replica. The event triggers a reset of
the internal scrubber state, and a scrubber FSM reset to 'not active'.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
m_queued_or_active now encodes additionally whether the
scrubber is queued as a replica or a primary.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
and inline its functionality into its sole caller
(PgScrubber::rm_from_osd_scrubbing())

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@ronen-fr ronen-fr force-pushed the wip-rf-58496-cln branch 6 times, most recently from cec0f5b to d419324 Compare February 2, 2023 20:35
@ronen-fr
Copy link
Contributor Author

ronen-fr commented Feb 2, 2023

@athanatos thanks for the work invested, especially re the commit splits

Separating and clarifying the handling of interval termination, new
interval starting, and configuration changes:

A primary PG now registers with its OSD for scrubbing only when fully
activated: on_pg_activate() called from PG::on_activate();

When the interval ends, the PG is removed from the OSD queue by
stop_active_scrubs(), called from start_peering_interval().

Configuration changes are still handled by update_scrub_job().

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
resetting m_no_reply from within causes a crush, and prevents
scrub code testing.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
The calls were removed as publish_stats..() expects a
stable PG state, and that state is not synchronized
with the scrubber operation.

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IntervalEnded{} introduces new ways to incorrectly reset the scrubber state. It seems like all cases where we need to reset the scrubber should simplify to submitting a FullReset{} event with the state handlers calling cleanup handlers as needed.

state_clear(PG_STATE_REPAIR);

clear_scrub_reservations();
m_pg->publish_stats_to_osd();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message for these removals is:

    osd/scrub: remove some calls to publish_stats_to_osd()
    
    The calls were removed as publish_stats..() expects a
    stable PG state, and that state is not synchronized
    with the scrubber operation.

I don't think this conveys to a future reader of this commit why it's safe to remove them. Either
a) the publish_stats_to_osd() calls are necessary or
b) they aren't.

In the case of a), the fact that they also require a stable PG state merely means that removing them introduces another bug. In the case of b), the fact that they require a stable PG state isn't really relevant.

Presumably, you've audited the code and concluded that they are safe to remove. Please explain that in the commit message.

void ReplicaReservations::send_reject()
{
// stop any pending timeout timer
m_no_reply.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message for this

    osd/scrub: fix replica-reservation timeout cancellation
    
    resetting m_no_reply from within causes a crush, and prevents
    scrub code testing.

doesn't explain how this results in a crash. At first glance, this modification would appear to be a noop as send_reject() is called immediately.

Further inspection shows that ReplicaReservations::handle_no_reply_timeout() also invokes ReplicaReservations::send_reject(). ReplicaReservations::handle_no_reply_timeout() is in turn invoked from ReplicaReservations::no_reply_t::m_abort_callback as populated in no_reply_t(). I suspect that the crash you are seeing is due to m_abort_callback being invoked from the timer thread under the sleep_lock. Invoking m_no_reply.reset() in that context would take the sleep_lock again resulting in an assert failure.

If the above is indeed the cause of the crash, this doesn't appear to be a fix as it fails to clean up m_no_reply in the timeout path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos : yes - that was the failed scenario. Checking what you said in the last sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now less sure about this diagnosis. SafeTimer can optionally either hold the lock while calling callbacks or not. There's a boolean param passed to the constructor which governs this behavior -- safe_callbacks. sleep_timer specifically seems to be constructed with safe_callbacks set to false, so locking OSDService::sleep_lock shouldn't be a problem.

Can you link to the tracker id?

"{}: interval changed ({} -> {}). Aborting active scrub.",
__func__, m_interval_start, m_pg->get_same_interval_since())
<< dendl;
scrub_clear_state();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff doesn't happen to contain PgScrubber::scrub_clear_state(), so I comment here.

// uses process_event(), so must be invoked externally
void PgScrubber::scrub_clear_state()
{
  dout(10) << __func__ << dendl;

  clear_pgscrub_state();
  m_fsm->process_event(FullReset{});
}

scrub_clear_state() copied above still uses a FullReset{}. ActiveReplica does happen to have a FullReset{} handler, but its behavior differs from IntervalEnded handler in that it lacks the call to replica_handling_done(). scrub_clear_state() has a few callers:

  1. PgScrubber::verify_against_abort() -- I find this mechanism confusing, tbh, but it does seem to be callable at any time, including when there are outstanding replica scrubs.
  2. PgScrubber::replica_scrub_op -- This is an error case and seems to also be callable while there is outstanding replica state.
  3. PrimaryLogPG::on_change -- I don't really see a reason for this one now that we call into scrubber in PG::on_new_interval(). This also seems like it means that replicas will get IntervalEnded and then FullReset and primaries will get FullReset twice.
  4. PrimaryLogPG::on_shutdown -- Same observations as above, uses FullReset though the scrubber might be a replica.

Introducing IntervalEnded seems to have created a bunch of opportunities for pg state resets to be inconsistent. Instead, I propose the following changes:

  1. Rename PgScrubber::stop_active_scrubs to PgScrubber::on_new_interval, simplify it to just submit a FullReset
  2. Modify the FullReset handlers for every state to call back into cleanup functions specific to their states (or do it in their destructors?)
  3. Drop the call to PgScrubber::clear_scrub_state() from PrimaryLogPG::on_change()
  4. Change PrimaryLogPG::on_shutdown to invoke a PgScrubber::on_shutdown method which submits a FullReset event.
  5. Update other reset pathways to use FullReset as well.

This way, the state machine states handle any state-specific behavior, and reset becomes simply an event which can be done identically from anywhere.

I think this should remove the need for queued_for_role() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename PgScrubber::stop_active_scrubs to PgScrubber::on_new_interval, simplify it to just submit a FullReset

@athanatos the problem with that is that the states of "being reserved by a Primary" (and the longer "a message carrying the request to be reserved by a primary was queued" are not handled within the scrubber state machine (it could have been done differently, at least regarding the first, but it's not something I wish to change now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "a message carrying the request to be reserved by a primary was queued", do you mean the time the message spends between being queued via the messenger Dispatcher interface and when it gets processed by the PG and and scrubber? Why is that important?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, do you mean the time between when replica_scrub_op is called and when the event queued there is processed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr I think it's worth mapping these resources onto the state machine. I've started working on a branch https://github.com/athanatos/ceph/tree/sjust/wip-scrub-interval-pr-49959-2 which does that. I'm still working on it, so it's likely got errors.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@ronen-fr ronen-fr marked this pull request as draft March 12, 2023 09:09
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 18, 2023
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants