osd/scrub: guarantee only a primary calls publish_stats_to_osd() by ronen-fr · Pull Request #50088 · ceph/ceph

ronen-fr · 2023-02-13T07:56:35Z

and fix "scrub blocked on object" alarm to prevent both
deadlocks, and calling publish_stats() where not allowed.
Note also the discussion in #49959.

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman rfriedma@redhat.com

neha-ojha · 2023-02-13T16:44:18Z

@ronen-fr have you identified a reliable reproducer for https://tracker.ceph.com/issues/58496? It seems to be showing up pretty frequently in teuthology testing and other places.

neha-ojha

@ronen-fr The first commit seems like it should fix the issue we saw in https://tracker.ceph.com/issues/58496, but I want to clarify my understanding regarding the commit message. In https://tracker.ceph.com/issues/58496#note-1, publish_stats_to_osd() was being called on the active primary, but at an incorrect time, so it wasn't the case of clear_pgscrub_state() being called on a replica. Does that match your analysis?

athanatos · 2023-02-14T00:37:58Z

@neha-ojha I added a more detailed rca to https://tracker.ceph.com/issues/58496. In fact, I don't think the osd was primary for the pg -- though I think it would have been in the next interval.

The commit message

 osd/scrub: do not call publish_stats_to_osd() where not allowed

publish_stats_to_osd() should only be called for an
active primary. As clear_pgscrub_state() can be reached
for a replica, all publishing was moved to
cleanup_on_finish().

Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

doesn't explain that we don't have information to publish in those cases anyway. Can we add something like:

osd/scrub: call publish_stats_to_osds only when we have new information as an active primary

This patch removes several unnecessary calls to publish_stats_to_osd in the
clear_pgscrub_state pathway used during interval changes.  We didn't actually
have anything to publish and we can't validly call publish_stats_to_osd when we
aren't an active primary anyway (which we can't be during an interval change).

I'm still reviewing this.

athanatos · 2023-02-14T00:45:47Z

@ronen-fr The following calls to publish_stats_to_osd() remain:

PgScrubber::update_scrub_job
PgScrubber::on_init
PgScrubber::set_op_parameters
PgScrubber::set_scrub_blocked
PgScrubber::clear_scrub_blocked (does this one really get called on a replica?)
PgScrubber::cleanup_on_finish (twice apparently?)
PgScrubber::update_scrub_stats

Can you add a commit adding a one line comment to each of these uses to indicate what stats may need to be updated?

athanatos · 2023-02-14T00:53:12Z

src/osd/scrubber/pg_scrubber.cc

-    m_warning_issued = true;
-    m_scrubber.set_scrub_blocked(utime_t{now_c,0});
+  m_callbk = new LambdaContext([=, this]([[maybe_unused]] int r) {
+    if (m_was_deleted) {


m_was_deleted would be a member of blocked_range_t. Accessing m_was_deleted in the event that it has been set to false would be a use-after-free error. It would also be a race condition since the pg lock hasn't been taken yet.

~~Instead, I suggest makin block_range_t a std::shared_ptr with a reference held by PgScrubber and also by this lambda to guarantee liveness.~~ -- won't actually work since we are using block_range_t lifetime to model the blocking duration.

Hmm, I think there are some other problems here as well. The interval/epoch check would be sufficient if the callback were only canceled on interval change, but we can create/cancel this many times within the same interval. Need to think about this.

athanatos · 2023-02-14T01:04:24Z

src/osd/scrubber/pg_scrubber.cc

+  if (!m_callbk) {
+    return;
+  }
+  m_was_deleted = true;	 // instead of accessing the sleep_timer


It looks like PendingTimer owns the block_range_t instance and react(const Unblocked&) is responsible for invoking cancel_future_alarm(), I think.

~block_range_t should own invoking cancel_future_alarm. Unblocked isn't the only way to leave RangeBlocked, and any other pathway leads to ~blocked_range_t not canceling it. I don't think it's actually a memory leak as such since the callback will eventually be called, but it does mean that the pg fails to release the timer queue resource. I suspect the goal here is to avoid the callback invoking cancel_future_alarm(). Instead, modify the callback to clear the m_callbk member once the pg is locked and the interval check has been done and modify ~blocked_range_t to skip cancel_future_alarm() if !m_callbk.

Not sure why this way is better. I don't consider the issue of having the callback - disabled - in the sleep queue as a major problem. It would happen only in the rare cases of blocking on an object, and discarded within a few seconds.

athanatos · 2023-02-14T02:05:32Z

It seems like the range_blocked_t warning timeout and the no_reply_t timeout discussed in the chain at #49959 (comment) should probably behave identically.

athanatos · 2023-02-14T06:35:07Z

@ronen-fr I think we should introduce a common mechanism for scheduling future events. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-event-helpers-50088 has a stab at such a mechanism. It builds, but I have done precisely no testing so it's probably buggy.

ronen-fr · 2023-02-14T10:57:22Z

@athanatos

@ronen-fr I think we should introduce a common mechanism for scheduling future events. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-event-helpers-50088 has a stab at such a mechanism. It builds, but I have done precisely no testing so it's probably buggy.

A comment I made there:
I have actually created a similar general mechanism. Some details that are implemented in my version (and in the two separate examples you've noted that implement the logic) and are missing (I think) here are:

a clear state of 'the callback was called (by the timer, without being cancelled or nullified earlier). This is required if that state must be reversed later on (e.g. in the 'blocked waiting for an object" - telling the OSD to decrement the 'waiting' counter);

a way to specify the behavior (a cluster message, log message, nothing) if the PG is no longer there, and if the PG is not in the same interval when triggered;

ronen-fr · 2023-02-14T10:58:36Z

A more general comment: we are getting too much into the realm of 'there could be a different way to do things'. Let's discuss that one offline.

ronen-fr · 2023-02-14T11:03:02Z

It seems like the range_blocked_t warning timeout and the no_reply_t timeout discussed in the chain at #49959 (comment) should probably behave identically.

Yes - I modified the 2nd to match the 1st.

This patch removes several unnecessary calls to publish_stats_to_osd() in the clear_pgscrub_state pathway used during interval changes. We didn't actually have anything to publish and we can't validly call publish_stats_to_osd() when we aren't an active primary anyway (which we can't be during an interval change). Fixes: https://tracker.ceph.com/issues/58496 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

Modified the code to prevent deadlocks, and to guarantee unsafe calls to publish_stats..(): - no locking in the dtor (which is always called under PG lock) - a separate API to cancel the alarm before it goes off - called under PG lock and acquiring the sleep-timer's - the callback now holds a PGRef Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

github-actions · 2023-03-10T07:23:45Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

athanatos

@ronen-fr Is this superseded by other work, or is it still in progress?

ronen-fr · 2023-05-11T14:00:44Z

Closing this one, as mostly superseded by the post-Reef scrub PRs.
Left (for me) to do - make sure there is a fix in Reef for $58496.

ronen-fr requested a review from a team as a code owner February 13, 2023 07:56

github-actions bot added the core label Feb 13, 2023

neha-ojha requested review from athanatos and neha-ojha February 13, 2023 18:07

neha-ojha reviewed Feb 13, 2023

View reviewed changes

athanatos reviewed Feb 14, 2023

View reviewed changes

ronen-fr added 2 commits February 19, 2023 16:14

ronen-fr requested review from athanatos and neha-ojha February 19, 2023 16:41

ronen-fr mentioned this pull request Feb 27, 2023

osd/scrub: only call publish_stats_to_osd() where allowed and needed #50283

Merged

github-actions bot added the needs-rebase label Mar 10, 2023

ronen-fr marked this pull request as draft March 12, 2023 09:12

athanatos requested changes May 11, 2023

View reviewed changes

ronen-fr closed this Aug 18, 2023

Conversation

ronen-fr commented Feb 13, 2023

Uh oh!

neha-ojha commented Feb 13, 2023

Uh oh!

neha-ojha left a comment

Choose a reason for hiding this comment

Uh oh!

athanatos commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

athanatos commented Feb 14, 2023

Uh oh!

athanatos Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 14, 2023

Choose a reason for hiding this comment

Uh oh!

athanatos Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronen-fr Feb 14, 2023

Choose a reason for hiding this comment

Uh oh!

athanatos commented Feb 14, 2023

Uh oh!

athanatos commented Feb 14, 2023

Uh oh!

ronen-fr commented Feb 14, 2023

Uh oh!

ronen-fr commented Feb 14, 2023

Uh oh!

ronen-fr commented Feb 14, 2023

Uh oh!

github-actions bot commented Mar 10, 2023

Uh oh!

athanatos left a comment

Choose a reason for hiding this comment

Uh oh!

ronen-fr commented May 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

athanatos commented Feb 14, 2023 •

edited

Loading

athanatos Feb 14, 2023 •

edited

Loading

athanatos Feb 14, 2023 •

edited

Loading