Skip to content

osd/PeeringState: fix missed recheck_readable from laggy#44499

Merged
tchaikov merged 2 commits intoceph:mainfrom
SMIL-Infra:fix-missed-lease-ack
Jul 27, 2022
Merged

osd/PeeringState: fix missed recheck_readable from laggy#44499
tchaikov merged 2 commits intoceph:mainfrom
SMIL-Infra:fix-missed-lease-ack

Conversation

@huww98
Copy link
Contributor

@huww98 huww98 commented Jan 7, 2022

Previously, the first pg_lease_ack_t after becoming laggy would not trigger recheck_readable. However, every other ack would trigger it. The logic is inverted, causing unnecessarily long laggy PG state.

Fixes: 3bb8a72 (osd: requeue ops when PG is no longer laggy)
Fixes: https://tracker.ceph.com/issues/53806
Signed-off-by: 胡玮文 huww98@outlook.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@huww98
Copy link
Contributor Author

huww98 commented Feb 7, 2022

Ping. This fix should be trivial. And I've provided the related logs in the tracker. Please help review this.

@tchaikov tchaikov self-requested a review February 7, 2022 16:45
@jdurgin
Copy link
Member

jdurgin commented Feb 7, 2022

Thanks for the PR, it's a good find, and wonderful that you've got a reproducer as well! Just one minor note.

It'd be even better if you could add an automated test case - one way is using the standalone bash tests (e.g. https://github.com/ceph/ceph/blob/master/qa/standalone/osd/divergent-priors.sh)

We should not have duplicated OSD ID in `acting`. So the loop would
execute once anyway.

Signed-off-by: 胡玮文 <huww98@outlook.com>
Previously, the first `pg_lease_ack_t` after becoming laggy would not
trigger `recheck_readable`. However, every other ack would trigger it.
The logic is inverted, causing unnecessarily long laggy PG state.

Fixes: 3bb8a72 (osd: requeue ops when PG is no longer laggy)
Fixes: https://tracker.ceph.com/issues/53806
Signed-off-by: 胡玮文 <huww98@outlook.com>
@huww98 huww98 force-pushed the fix-missed-lease-ack branch from bb0ae5b to caeca39 Compare February 24, 2022 09:26
@huww98 huww98 requested a review from jdurgin February 24, 2022 09:27
@huww98
Copy link
Contributor Author

huww98 commented Feb 24, 2022

@jdurgin Sorry for the delay. I missed the notification.

I currently don't have enough time to look into the test. I think the automated test can be hard since the observable impact of this bug is only a slower exit from laggy state. Calculating the timing in the test script can be unstable.

@rzarzynski
Copy link
Contributor

rzarzynski commented Apr 13, 2022

@jdurgin: how about retaking a look?
We could separate writing the test and put it into some kind of backlog with tests-we-want-to-have.

@djgalloway djgalloway changed the base branch from master to main May 27, 2022 16:10
@huww98
Copy link
Contributor Author

huww98 commented Jun 29, 2022

Ping @tchaikov @jdurgin . Could you help review this 2 line fix? We would like this to be in the next point release of quincy.

@tchaikov tchaikov requested a review from athanatos June 29, 2022 14:00
@tchaikov
Copy link
Contributor

hi Sam, as you reviewed #29236 , i am adding you as another reviewer.

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that looks right. Good catch!

@tchaikov
Copy link
Contributor

@ljflores
Copy link
Member

@huww98 we found that this PR has caused a regression. We are reverting the Quincy backport so we can release 17.2.4: #48104

Please see https://tracker.ceph.com/issues/57546 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants