osd/scrub: fix the handling of interval changes#49959
osd/scrub: fix the handling of interval changes#49959
Conversation
fe49cd7 to
739fc4b
Compare
739fc4b to
ee7038c
Compare
There was a problem hiding this comment.
Seems reasonable structurally, see my comments.
This PR's single commit seems to contain several orthogonal changes. In order to review it, I needed to split it into smaller atomic commits. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-interval-pr-49959 has those commits -- please split up this PR in a similar way (commit messages in my branch are likely insufficient or incorrect, I'm not sure what the motivation was for a few of them). In the future, it would really expedite review if you'd split up your PRs in a similar manner.
src/osd/scrubber/pg_scrubber.cc
Outdated
| // verify that the 'in_q' status matches our "Primariority" | ||
| if (m_scrub_job && is_primary() && !m_scrub_job->in_queues) { | ||
| dout(1) << __func__ << " !!! primary but not scheduled! " << dendl; | ||
| dout(10) << __func__ << " !!! primary but not scheduled! " << dendl; |
There was a problem hiding this comment.
This is validating an invariant that should always hold, right? I would think that derr would be more appropriate.
There was a problem hiding this comment.
It was (that's why I used dout(1) - and should have probably used derr).
I was less sure now (that's why changed to 10), and as the logic changes in that
othe scrub-sched PR - didn't invest time in making sure.
There was a problem hiding this comment.
For now - left this change out of the PR
src/osd/scrubber/pg_scrubber.cc
Outdated
| // cleared. | ||
| replica_handling_done(); | ||
| m_fsm->process_event(FullReset{}); | ||
| if (m_fsm->is_active_replica()) { |
There was a problem hiding this comment.
It seems to me that we probably want FullReset{} to be a valid event regardless of the state of the state machine -- why not add noop handlers so that it's universally valid?
There was a problem hiding this comment.
the full_reset event is always effective, regardless of the FSM state (as it was meant
as a way to guarantee reset).
For the replica case - I like the suggestion of adding a new event to the state machine. Performing
this change now.
For the primary case - that would not be enough to replace the 'role' data I have added here (for the
same reason m_active does not cover the whole timespan covered by is_queued_or_active). Adding an
event for this case is only reducing clarity (I have tried that now).
'IntervalEnded' is generated whenever an interval ends while the scrubber is acting as a replica. The event triggers a reset of the internal scrubber state, and a scrubber FSM reset to 'not active'. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
ee7038c to
1e29441
Compare
m_queued_or_active now encodes additionally whether the scrubber is queued as a replica or a primary. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
and inline its functionality into its sole caller (PgScrubber::rm_from_osd_scrubbing()) Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
cec0f5b to
d419324
Compare
|
@athanatos thanks for the work invested, especially re the commit splits |
Separating and clarifying the handling of interval termination, new interval starting, and configuration changes: A primary PG now registers with its OSD for scrubbing only when fully activated: on_pg_activate() called from PG::on_activate(); When the interval ends, the PG is removed from the OSD queue by stop_active_scrubs(), called from start_peering_interval(). Configuration changes are still handled by update_scrub_job(). Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
resetting m_no_reply from within causes a crush, and prevents scrub code testing. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
The calls were removed as publish_stats..() expects a stable PG state, and that state is not synchronized with the scrubber operation. Fixes: https://tracker.ceph.com/issues/58496 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
d419324 to
dc0a9f7
Compare
athanatos
left a comment
There was a problem hiding this comment.
I think IntervalEnded{} introduces new ways to incorrectly reset the scrubber state. It seems like all cases where we need to reset the scrubber should simplify to submitting a FullReset{} event with the state handlers calling cleanup handlers as needed.
| state_clear(PG_STATE_REPAIR); | ||
|
|
||
| clear_scrub_reservations(); | ||
| m_pg->publish_stats_to_osd(); |
There was a problem hiding this comment.
The commit message for these removals is:
osd/scrub: remove some calls to publish_stats_to_osd()
The calls were removed as publish_stats..() expects a
stable PG state, and that state is not synchronized
with the scrubber operation.
I don't think this conveys to a future reader of this commit why it's safe to remove them. Either
a) the publish_stats_to_osd() calls are necessary or
b) they aren't.
In the case of a), the fact that they also require a stable PG state merely means that removing them introduces another bug. In the case of b), the fact that they require a stable PG state isn't really relevant.
Presumably, you've audited the code and concluded that they are safe to remove. Please explain that in the commit message.
| void ReplicaReservations::send_reject() | ||
| { | ||
| // stop any pending timeout timer | ||
| m_no_reply.reset(); |
There was a problem hiding this comment.
The message for this
osd/scrub: fix replica-reservation timeout cancellation
resetting m_no_reply from within causes a crush, and prevents
scrub code testing.
doesn't explain how this results in a crash. At first glance, this modification would appear to be a noop as send_reject() is called immediately.
Further inspection shows that ReplicaReservations::handle_no_reply_timeout() also invokes ReplicaReservations::send_reject(). ReplicaReservations::handle_no_reply_timeout() is in turn invoked from ReplicaReservations::no_reply_t::m_abort_callback as populated in no_reply_t(). I suspect that the crash you are seeing is due to m_abort_callback being invoked from the timer thread under the sleep_lock. Invoking m_no_reply.reset() in that context would take the sleep_lock again resulting in an assert failure.
If the above is indeed the cause of the crash, this doesn't appear to be a fix as it fails to clean up m_no_reply in the timeout path.
There was a problem hiding this comment.
@athanatos : yes - that was the failed scenario. Checking what you said in the last sentence.
There was a problem hiding this comment.
I'm now less sure about this diagnosis. SafeTimer can optionally either hold the lock while calling callbacks or not. There's a boolean param passed to the constructor which governs this behavior -- safe_callbacks. sleep_timer specifically seems to be constructed with safe_callbacks set to false, so locking OSDService::sleep_lock shouldn't be a problem.
Can you link to the tracker id?
| "{}: interval changed ({} -> {}). Aborting active scrub.", | ||
| __func__, m_interval_start, m_pg->get_same_interval_since()) | ||
| << dendl; | ||
| scrub_clear_state(); |
There was a problem hiding this comment.
This diff doesn't happen to contain PgScrubber::scrub_clear_state(), so I comment here.
// uses process_event(), so must be invoked externally
void PgScrubber::scrub_clear_state()
{
dout(10) << __func__ << dendl;
clear_pgscrub_state();
m_fsm->process_event(FullReset{});
}
scrub_clear_state() copied above still uses a FullReset{}. ActiveReplica does happen to have a FullReset{} handler, but its behavior differs from IntervalEnded handler in that it lacks the call to replica_handling_done(). scrub_clear_state() has a few callers:
- PgScrubber::verify_against_abort() -- I find this mechanism confusing, tbh, but it does seem to be callable at any time, including when there are outstanding replica scrubs.
- PgScrubber::replica_scrub_op -- This is an error case and seems to also be callable while there is outstanding replica state.
- PrimaryLogPG::on_change -- I don't really see a reason for this one now that we call into scrubber in PG::on_new_interval(). This also seems like it means that replicas will get IntervalEnded and then FullReset and primaries will get FullReset twice.
- PrimaryLogPG::on_shutdown -- Same observations as above, uses FullReset though the scrubber might be a replica.
Introducing IntervalEnded seems to have created a bunch of opportunities for pg state resets to be inconsistent. Instead, I propose the following changes:
- Rename PgScrubber::stop_active_scrubs to PgScrubber::on_new_interval, simplify it to just submit a FullReset
- Modify the FullReset handlers for every state to call back into cleanup functions specific to their states (or do it in their destructors?)
- Drop the call to PgScrubber::clear_scrub_state() from PrimaryLogPG::on_change()
- Change PrimaryLogPG::on_shutdown to invoke a PgScrubber::on_shutdown method which submits a FullReset event.
- Update other reset pathways to use FullReset as well.
This way, the state machine states handle any state-specific behavior, and reset becomes simply an event which can be done identically from anywhere.
I think this should remove the need for queued_for_role() as well?
There was a problem hiding this comment.
Rename PgScrubber::stop_active_scrubs to PgScrubber::on_new_interval, simplify it to just submit a FullReset
@athanatos the problem with that is that the states of "being reserved by a Primary" (and the longer "a message carrying the request to be reserved by a primary was queued" are not handled within the scrubber state machine (it could have been done differently, at least regarding the first, but it's not something I wish to change now)
There was a problem hiding this comment.
By "a message carrying the request to be reserved by a primary was queued", do you mean the time the message spends between being queued via the messenger Dispatcher interface and when it gets processed by the PG and and scrubber? Why is that important?
There was a problem hiding this comment.
Or, do you mean the time between when replica_scrub_op is called and when the event queued there is processed?
There was a problem hiding this comment.
@ronen-fr I think it's worth mapping these resources onto the state machine. I've started working on a branch https://github.com/athanatos/ceph/tree/sjust/wip-scrub-interval-pr-49959-2 which does that. I'm still working on it, so it's likely got errors.
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Separating and clarifying the handling of interval termination, new interval starting, and configuration changes:
A primary PG now registers with its OSD for scrubbing only when fully activated: on_pg_activate() called from PG::on_activate();
When the interval ends, the PG is removed from the OSD queue by stop_active_scrubs(), called from start_peering_interval().
Configuration changes are still handled by update_scrub_job().
Signed-off-by: Ronen Friedman rfriedma@redhat.com