Skip to content

osd/scrub: guarantee only a primary calls publish_stats_to_osd()#50088

Closed
ronen-fr wants to merge 2 commits intoceph:mainfrom
ronen-fr:wip-rf-unpublished-work
Closed

osd/scrub: guarantee only a primary calls publish_stats_to_osd()#50088
ronen-fr wants to merge 2 commits intoceph:mainfrom
ronen-fr:wip-rf-unpublished-work

Conversation

@ronen-fr
Copy link
Contributor

and fix "scrub blocked on object" alarm to prevent both
deadlocks, and calling publish_stats() where not allowed.
Note also the discussion in #49959.

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman rfriedma@redhat.com

@ronen-fr ronen-fr requested a review from a team as a code owner February 13, 2023 07:56
@github-actions github-actions bot added the core label Feb 13, 2023
@neha-ojha
Copy link
Member

@ronen-fr have you identified a reliable reproducer for https://tracker.ceph.com/issues/58496? It seems to be showing up pretty frequently in teuthology testing and other places.

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr The first commit seems like it should fix the issue we saw in https://tracker.ceph.com/issues/58496, but I want to clarify my understanding regarding the commit message. In https://tracker.ceph.com/issues/58496#note-1, publish_stats_to_osd() was being called on the active primary, but at an incorrect time, so it wasn't the case of clear_pgscrub_state() being called on a replica. Does that match your analysis?

@athanatos
Copy link
Contributor

athanatos commented Feb 14, 2023

@neha-ojha I added a more detailed rca to https://tracker.ceph.com/issues/58496. In fact, I don't think the osd was primary for the pg -- though I think it would have been in the next interval.

The commit message

 osd/scrub: do not call publish_stats_to_osd() where not allowed

publish_stats_to_osd() should only be called for an
active primary. As clear_pgscrub_state() can be reached
for a replica, all publishing was moved to
cleanup_on_finish().

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

doesn't explain that we don't have information to publish in those cases anyway. Can we add something like:

osd/scrub: call publish_stats_to_osds only when we have new information as an active primary

This patch removes several unnecessary calls to publish_stats_to_osd in the
clear_pgscrub_state pathway used during interval changes.  We didn't actually
have anything to publish and we can't validly call publish_stats_to_osd when we
aren't an active primary anyway (which we can't be during an interval change).

I'm still reviewing this.

@athanatos
Copy link
Contributor

@ronen-fr The following calls to publish_stats_to_osd() remain:

  • PgScrubber::update_scrub_job
  • PgScrubber::on_init
  • PgScrubber::set_op_parameters
  • PgScrubber::set_scrub_blocked
  • PgScrubber::clear_scrub_blocked (does this one really get called on a replica?)
  • PgScrubber::cleanup_on_finish (twice apparently?)
  • PgScrubber::update_scrub_stats

Can you add a commit adding a one line comment to each of these uses to indicate what stats may need to be updated?

m_warning_issued = true;
m_scrubber.set_scrub_blocked(utime_t{now_c,0});
m_callbk = new LambdaContext([=, this]([[maybe_unused]] int r) {
if (m_was_deleted) {
Copy link
Contributor

@athanatos athanatos Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_was_deleted would be a member of blocked_range_t. Accessing m_was_deleted in the event that it has been set to false would be a use-after-free error. It would also be a race condition since the pg lock hasn't been taken yet.

Instead, I suggest makin block_range_t a std::shared_ptr with a reference held by PgScrubber and also by this lambda to guarantee liveness. -- won't actually work since we are using block_range_t lifetime to model the blocking duration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think there are some other problems here as well. The interval/epoch check would be sufficient if the callback were only canceled on interval change, but we can create/cancel this many times within the same interval. Need to think about this.

if (!m_callbk) {
return;
}
m_was_deleted = true; // instead of accessing the sleep_timer
Copy link
Contributor

@athanatos athanatos Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like PendingTimer owns the block_range_t instance and react(const Unblocked&) is responsible for invoking cancel_future_alarm(), I think.

~block_range_t should own invoking cancel_future_alarm. Unblocked isn't the only way to leave RangeBlocked, and any other pathway leads to ~blocked_range_t not canceling it. I don't think it's actually a memory leak as such since the callback will eventually be called, but it does mean that the pg fails to release the timer queue resource. I suspect the goal here is to avoid the callback invoking cancel_future_alarm(). Instead, modify the callback to clear the m_callbk member once the pg is locked and the interval check has been done and modify ~blocked_range_t to skip cancel_future_alarm() if !m_callbk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this way is better. I don't consider the issue of having the callback - disabled - in the sleep queue as a major problem. It would happen only in the rare cases of blocking on an object, and discarded within a few seconds.

@athanatos
Copy link
Contributor

It seems like the range_blocked_t warning timeout and the no_reply_t timeout discussed in the chain at #49959 (comment) should probably behave identically.

@athanatos
Copy link
Contributor

@ronen-fr I think we should introduce a common mechanism for scheduling future events. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-event-helpers-50088 has a stab at such a mechanism. It builds, but I have done precisely no testing so it's probably buggy.

@ronen-fr
Copy link
Contributor Author

@athanatos

@ronen-fr I think we should introduce a common mechanism for scheduling future events. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-event-helpers-50088 has a stab at such a mechanism. It builds, but I have done precisely no testing so it's probably buggy.

A comment I made there:
I have actually created a similar general mechanism. Some details that are implemented in my version (and in the two separate examples you've noted that implement the logic) and are missing (I think) here are:

a clear state of 'the callback was called (by the timer, without being cancelled or nullified earlier). This is required if that state must be reversed later on (e.g. in the 'blocked waiting for an object" - telling the OSD to decrement the 'waiting' counter);

a way to specify the behavior (a cluster message, log message, nothing) if the PG is no longer there, and if the PG is not in the same interval when triggered;

@ronen-fr
Copy link
Contributor Author

A more general comment: we are getting too much into the realm of 'there could be a different way to do things'. Let's discuss that one offline.

@ronen-fr
Copy link
Contributor Author

It seems like the range_blocked_t warning timeout and the no_reply_t timeout discussed in the chain at #49959 (comment) should probably behave identically.

Yes - I modified the 2nd to match the 1st.

This patch removes several unnecessary calls to publish_stats_to_osd() in
the clear_pgscrub_state pathway used during interval changes. We didn't
actually have anything to publish and we can't validly call publish_stats_to_osd()
when we aren't an active primary anyway (which we can't be during an interval
change).

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Modified the code to prevent deadlocks, and to guarantee unsafe
calls to publish_stats..():

- no locking in the dtor (which is always called under PG lock)

- a separate API to cancel the alarm before it goes off - called
under PG lock and acquiring the sleep-timer's

- the callback now holds a PGRef

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@ronen-fr ronen-fr marked this pull request as draft March 12, 2023 09:12
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr Is this superseded by other work, or is it still in progress?

@ronen-fr
Copy link
Contributor Author

Closing this one, as mostly superseded by the post-Reef scrub PRs.
Left (for me) to do - make sure there is a fix in Reef for $58496.

@ronen-fr ronen-fr closed this Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants