osd/PeeringState: fix missed recheck_readable from laggy#44499
osd/PeeringState: fix missed recheck_readable from laggy#44499
Conversation
|
Ping. This fix should be trivial. And I've provided the related logs in the tracker. Please help review this. |
|
Thanks for the PR, it's a good find, and wonderful that you've got a reproducer as well! Just one minor note. It'd be even better if you could add an automated test case - one way is using the standalone bash tests (e.g. https://github.com/ceph/ceph/blob/master/qa/standalone/osd/divergent-priors.sh) |
We should not have duplicated OSD ID in `acting`. So the loop would execute once anyway. Signed-off-by: 胡玮文 <huww98@outlook.com>
Previously, the first `pg_lease_ack_t` after becoming laggy would not trigger `recheck_readable`. However, every other ack would trigger it. The logic is inverted, causing unnecessarily long laggy PG state. Fixes: 3bb8a72 (osd: requeue ops when PG is no longer laggy) Fixes: https://tracker.ceph.com/issues/53806 Signed-off-by: 胡玮文 <huww98@outlook.com>
bb0ae5b to
caeca39
Compare
|
@jdurgin Sorry for the delay. I missed the notification. I currently don't have enough time to look into the test. I think the automated test can be hard since the observable impact of this bug is only a slower exit from laggy state. Calculating the timing in the test script can be unstable. |
|
@jdurgin: how about retaking a look? |
|
hi Sam, as you reviewed #29236 , i am adding you as another reviewer. |
athanatos
left a comment
There was a problem hiding this comment.
Yep, that looks right. Good catch!
|
failures tracked by https://tracker.ceph.com/issues/56652 |
|
@huww98 we found that this PR has caused a regression. We are reverting the Quincy backport so we can release 17.2.4: #48104 Please see https://tracker.ceph.com/issues/57546 for more details. |
Previously, the first
pg_lease_ack_tafter becoming laggy would not triggerrecheck_readable. However, every other ack would trigger it. The logic is inverted, causing unnecessarily long laggy PG state.Fixes: 3bb8a72 (osd: requeue ops when PG is no longer laggy)
Fixes: https://tracker.ceph.com/issues/53806
Signed-off-by: 胡玮文 huww98@outlook.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox