Skip to content

test: repeer_on_down_acting_member_coming_back is continuously failing#65433

Merged
SrinivasaBharath merged 1 commit intoceph:mainfrom
mohit84:repeer_on_acting
Dec 1, 2025
Merged

test: repeer_on_down_acting_member_coming_back is continuously failing#65433
SrinivasaBharath merged 1 commit intoceph:mainfrom
mohit84:repeer_on_acting

Conversation

@mohit84
Copy link
Contributor

@mohit84 mohit84 commented Sep 8, 2025

The current behavior of the script is not correct to validate acting set after reset pg-upmap so script is failing.

  1. Replace blind 2s sleep with a loop that checks
    if OSD.2 is part of the acting set.
  2. Validate OSD.2 presence in the acting set only while
    recovery is in progress
  3. Update log message to clarify validation of received
    pg_temp change

Fixes: https://tracker.ceph.com/issues/70949

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

The current behavior of the script is not correct to validate
acting set after reset pg-upmap so script is failing.

1) Replace blind 2s sleep with a loop that checks
   if OSD.2 is part of the acting set.
2) Validate OSD.2 presence in the acting set only while
   recovery is in progress
3) Update log message to clarify validation of received
   pg_temp change

Fixes: https://tracker.ceph.com/issues/70949
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@mohit84 mohit84 requested a review from a team as a code owner September 8, 2025 11:56
Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good, it eliminates the race hazard in this test by coping with the scenario where recovery completes within 2 seconds.

See #65831 for an alternative fix that uses the backfill_toofull error inject to force the PG to get stuck in recovering state so that the script can verify the acting set before clearing the error inject and allowing recovery to complete

@JonBailey1993
Copy link
Contributor

@ljflores
Copy link
Member

jenkins test api

@ljflores
Copy link
Member

jenkins test make check arm64

@SrinivasaBharath SrinivasaBharath merged commit 13f95de into ceph:main Dec 1, 2025
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants