osd/scrub: guarantee only a primary calls publish_stats_to_osd()#50088
osd/scrub: guarantee only a primary calls publish_stats_to_osd()#50088
Conversation
|
@ronen-fr have you identified a reliable reproducer for https://tracker.ceph.com/issues/58496? It seems to be showing up pretty frequently in teuthology testing and other places. |
neha-ojha
left a comment
There was a problem hiding this comment.
@ronen-fr The first commit seems like it should fix the issue we saw in https://tracker.ceph.com/issues/58496, but I want to clarify my understanding regarding the commit message. In https://tracker.ceph.com/issues/58496#note-1, publish_stats_to_osd() was being called on the active primary, but at an incorrect time, so it wasn't the case of clear_pgscrub_state() being called on a replica. Does that match your analysis?
|
@neha-ojha I added a more detailed rca to https://tracker.ceph.com/issues/58496. In fact, I don't think the osd was primary for the pg -- though I think it would have been in the next interval. The commit message doesn't explain that we don't have information to publish in those cases anyway. Can we add something like: I'm still reviewing this. |
|
@ronen-fr The following calls to publish_stats_to_osd() remain:
Can you add a commit adding a one line comment to each of these uses to indicate what stats may need to be updated? |
| m_warning_issued = true; | ||
| m_scrubber.set_scrub_blocked(utime_t{now_c,0}); | ||
| m_callbk = new LambdaContext([=, this]([[maybe_unused]] int r) { | ||
| if (m_was_deleted) { |
There was a problem hiding this comment.
m_was_deleted would be a member of blocked_range_t. Accessing m_was_deleted in the event that it has been set to false would be a use-after-free error. It would also be a race condition since the pg lock hasn't been taken yet.
Instead, I suggest makin block_range_t a std::shared_ptr with a reference held by PgScrubber and also by this lambda to guarantee liveness. -- won't actually work since we are using block_range_t lifetime to model the blocking duration.
There was a problem hiding this comment.
Hmm, I think there are some other problems here as well. The interval/epoch check would be sufficient if the callback were only canceled on interval change, but we can create/cancel this many times within the same interval. Need to think about this.
| if (!m_callbk) { | ||
| return; | ||
| } | ||
| m_was_deleted = true; // instead of accessing the sleep_timer |
There was a problem hiding this comment.
It looks like PendingTimer owns the block_range_t instance and react(const Unblocked&) is responsible for invoking cancel_future_alarm(), I think.
~block_range_t should own invoking cancel_future_alarm. Unblocked isn't the only way to leave RangeBlocked, and any other pathway leads to ~blocked_range_t not canceling it. I don't think it's actually a memory leak as such since the callback will eventually be called, but it does mean that the pg fails to release the timer queue resource. I suspect the goal here is to avoid the callback invoking cancel_future_alarm(). Instead, modify the callback to clear the m_callbk member once the pg is locked and the interval check has been done and modify ~blocked_range_t to skip cancel_future_alarm() if !m_callbk.
There was a problem hiding this comment.
Not sure why this way is better. I don't consider the issue of having the callback - disabled - in the sleep queue as a major problem. It would happen only in the rare cases of blocking on an object, and discarded within a few seconds.
|
It seems like the range_blocked_t warning timeout and the no_reply_t timeout discussed in the chain at #49959 (comment) should probably behave identically. |
|
@ronen-fr I think we should introduce a common mechanism for scheduling future events. https://github.com/athanatos/ceph/tree/sjust/wip-scrub-event-helpers-50088 has a stab at such a mechanism. It builds, but I have done precisely no testing so it's probably buggy. |
A comment I made there: a clear state of 'the callback was called (by the timer, without being cancelled or nullified earlier). This is required if that state must be reversed later on (e.g. in the 'blocked waiting for an object" - telling the OSD to decrement the 'waiting' counter); a way to specify the behavior (a cluster message, log message, nothing) if the PG is no longer there, and if the PG is not in the same interval when triggered; |
|
A more general comment: we are getting too much into the realm of 'there could be a different way to do things'. Let's discuss that one offline. |
Yes - I modified the 2nd to match the 1st. |
This patch removes several unnecessary calls to publish_stats_to_osd() in the clear_pgscrub_state pathway used during interval changes. We didn't actually have anything to publish and we can't validly call publish_stats_to_osd() when we aren't an active primary anyway (which we can't be during an interval change). Fixes: https://tracker.ceph.com/issues/58496 Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Modified the code to prevent deadlocks, and to guarantee unsafe calls to publish_stats..(): - no locking in the dtor (which is always called under PG lock) - a separate API to cancel the alarm before it goes off - called under PG lock and acquiring the sleep-timer's - the callback now holds a PGRef Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
Closing this one, as mostly superseded by the post-Reef scrub PRs. |
and fix "scrub blocked on object" alarm to prevent both
deadlocks, and calling publish_stats() where not allowed.
Note also the discussion in #49959.
Fixes: https://tracker.ceph.com/issues/58496
Signed-off-by: Ronen Friedman rfriedma@redhat.com