qa/standalone: improve reliability of osd-backfill tests#67295
qa/standalone: improve reliability of osd-backfill tests#67295rzarzynski merged 4 commits intoceph:mainfrom
Conversation
- Increase data size from 10MB to 100MB per object to extend backfill duration, making in-progress reservations more observable - Add polling loop to wait for recovery reservations to start before validation, addressing timing issues on varied CI machines Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
8701a02 to
ec96481
Compare
So we actually fixed the failures in osd-backfill-prio.sh which is the first script that gets executed in standalone/osd-backfill. We made progress but hit another failure in: osd-backfill-recovery-log
/a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43379/teuthology.log Furthermore, there are times where osd-backfill-recovery-log passed, but we fail in /a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43380/teuthology.log |
…ry-log Add a 30-second polling loop after flush_pg_stats to wait for PG log and duplicate entries to be trimmed to their expected sizes before validation. This addresses timing issues where the test was inspecting the objectstore before log trimming operations completed. The loop polls 'ceph pg query' to check both log and dups lengths, breaking when both reach or fall below expected thresholds. This prevents spurious test failures on varied teuthology machines where log trimming happens at different speed after recovery completes. Solves intermittent failures where logs showed 50 entries instead of expected 2, and dups showed 7 instead of expected 8. Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
75a3ba7 to
6a02f6b
Compare
|
For failed osd-backfill-recovery-log.sh tests sometimes we ran into a situation where |
Problem: In osd-backfill-recovery-logs.sh we were trying to do `ceph osd out` on a empty value when `ceph pg dump pgs` returned nothing. Solution: In ceph-helpers.sh we created wait_for_pg_data() to wait for 30 seconds (default) for `pg dump pgs` to receive some values before proceeding. Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
|
Made some progress by patching osd-backfill-recovery-log.sh with wait for pg dumps: https://pulpito.ceph.com/ksirivad-2026-02-11_19:21:20-rados:standalone-main-distro-default-trial/ Now we just have to figure out We are failing the test that's basically simulating 1 OSD (target OSD) receiving 2 backfills as the same time from 2 PGs that shares the same the target OSD as replica. The target OSD is almost at capacity and the expectation is that 1 PG should be active+clean (success in backfill) and the other one should be (backfill_toofull) since it fails to backfill due to full capacity of the OSD. |
|
Why osd.0 does not enter backfill_toofull for PG 4.0 and 1.0 In this test scenario, osd.0 is the replica for both: PG 4.0 → acting [4,0] (primary 4) PG 1.0 → acting [1,0] (primary 1) Although osd.0 logs _tentative_full type backfillfull adjust_used 2400KiB for both PGs, it never rejects the backfill reservations. Instead, it: Receives MBackfillReserve(REQUEST) from the primary. Responds with MBackfillReserve(GRANT) for both PGs. Proceeds with recovery/backfill traffic (MOSDPGPush). Later receives MBackfillReserve(RELEASE) from the primaries. PGs converge to active+clean. backfill_toofull is triggered on the primary when the replica rejects the reservation or is marked backfillfull/full in the OSDMap. In this run, osd.0 never denies the reservation, so the primary has no reason to transition the PG into backfill_toofull. The Conclusion This behavior is expected given the current thresholds and reservation logic. To deterministically trigger backfill_toofull, the replica must: Be marked backfillfull/full in the OSDMap before reservation, or Explicitly deny MBackfillReserve due to space constraints. In this test, neither condition occurred, so backfill_toofull was not entered. |
6bc7dd3 to
b90f596
Compare
|
This PR is ready for review, teuthology test passed 20/20 |
Problem: In TEST_backfill_test_sametarget test fails to keep 1 PG at backfill-toofull and 1 PG at active+clean due to not giving enough time from 1 backfill to the other In TEST_ec_backfill_multi Similar to the test_sametarget above, not enough time from 1 backfill to the other Solution: Give more time gap between one backfill request from another using sleep <duration> bash command Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
b90f596 to
29fb42e
Compare
| return 1 | ||
| fi | ||
| count=$(expr $count + 1) | ||
| done |
| echo "WARNING: Log trimming timeout after ${TIMEOUT}s - log=$current_log_len (expected <=$loglen), dups=$current_dups_len (expected <=$dupslen)" | ||
| break | ||
| fi | ||
| done |
| ceph osd out $(ceph pg dump pgs --format=json | jq '.pg_stats[0].up[]') | ||
|
|
||
| # Wait for PG to be visible and mark out all OSDs for this pool | ||
| local pg_up_osds=$(wait_for_pg_data '.pg_stats[0].up[]') || return 1 |
|
Rados Approved. |
Problem:
Solution:
and duplicate entries to be trimmed to their expected sizes before
validation. This addresses timing issues where the test was inspecting
the objectstore before log trimming operations completed. The loop polls 'ceph pg query' to check both log and dups lengths, breaking when both reach or fall below expected thresholds. This
prevents spurious test failures on varied teuthology machines where log
trimming happens at different speed after recovery completes.
Solves intermittent failures where logs showed 50 entries instead of
expected 2, and dups showed 7 instead of expected 8.
sleepbetween each backfill request to make the test correctly create a scenario where 1 PG gets to backfill completely and the other PG failed to backfill due to backfill-toofull.Fixes: https://tracker.ceph.com/issues/74524
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.