qa/standalone: improve reliability of osd-backfill tests by kamoltat · Pull Request #67295 · ceph/ceph

kamoltat · 2026-02-10T18:18:23Z

Problem:

Currently the osd-backfill-prio test sometimes fails to capture in-progress remote and local reservations, we suspect that this inconsistency stems from the lab migration where the hardware are now faster/slower.
osd-backfill-recovery-log sometimes fail to trim pg logs and create the correct pg dups
osd-backfill-space sometimes fail due to backfilling requests not having enough time between each other when we are backfilling 1 OSD with 2 PGs simultaneously, therefore, we didn't get the expected result that we want ( 1 PG active+clean, 1 PG backfill-toofull)

Solution:

For osd-backfill-prio, we increase data size from 10MB to 100MB per object to extend backfill duration, making in-progress reservations more observable
Add polling loop to wait for recovery reservations to start before validation, addressing timing issues on varied teuthology test machines
For osd-backfill-recovery-log we added a 30-second polling loop after flush_pg_stats to wait for PG log
and duplicate entries to be trimmed to their expected sizes before
validation. This addresses timing issues where the test was inspecting
the objectstore before log trimming operations completed. The loop polls 'ceph pg query' to check both log and dups lengths, breaking when both reach or fall below expected thresholds. This
prevents spurious test failures on varied teuthology machines where log
trimming happens at different speed after recovery completes.
Solves intermittent failures where logs showed 50 entries instead of
expected 2, and dups showed 7 instead of expected 8.
For osd-backfill-space we do a sleep between each backfill request to make the test correctly create a scenario where 1 PG gets to backfill completely and the other PG failed to backfill due to backfill-toofull.

Fixes: https://tracker.ceph.com/issues/74524

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

- Increase data size from 10MB to 100MB per object to extend backfill duration, making in-progress reservations more observable - Add polling loop to wait for recovery reservations to start before validation, addressing timing issues on varied CI machines Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

kamoltat · 2026-02-10T18:48:19Z

Running tests:
https://pulpito.ceph.com/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/

kamoltat · 2026-02-10T19:38:03Z

Running tests: https://pulpito.ceph.com/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/

So we actually fixed the failures in osd-backfill-prio.sh which is the first script that gets executed in standalone/osd-backfill. We made progress but hit another failure in:

osd-backfill-recovery-log
TLDR;

we expect pg_log to be trimmed to length 2 but we got length 50 instead
we expect a dups length of 8 but got 7

/a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43379/teuthology.log

2026-02-10T19:17:59.393 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:85: _common_test:  jq '.pg_log_t.log | length' td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:85: _common_test:  LOGLEN=50
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:86: _common_test:  '[' 50 '!=' 2 ']'
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stdout:FAILED: Wrong log length got 50 (expected 2)
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:87: _common_test:  echo 'FAILED: Wrong log length got 50 (expected 2)'
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:88: _common_test:  expr 0 + 1
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:88: _common_test:  ERRORS=1
2026-02-10T19:17:59.406 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:90: _common_test:  jq '.pg_log_t.dups | length' td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.416 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:90: _common_test:  DUPSLEN=7
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:91: _common_test:  '[' 7 '!=' 8 ']'
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stdout:FAILED: Wrong dups length got 7 (expected 8)
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:92: _common_test:  echo 'FAILED: Wrong dups length got 7 (expected 8)'
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:93: _common_test:  expr 1 + 1
2026-02-10T19:17:59.418 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:93: _common_test:  ERRORS=2
2026-02-10T19:17:59.418 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:95: _common_test:  grep 'copy_up_to\|copy_after' td/osd-backfill-recovery-log/osd.0.log td/osd-backfill-recovery-log/osd.1.log td/osd-backfill-recovery-log/osd.2.log td/osd-backfill-recovery-log/osd.3.log td/osd-backfill-recovery-log/osd.4.log td/osd-backfill-recovery-log/osd.5.log
2026-02-10T19:17:59.426 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:96: _common_test:  rm -f td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:97: _common_test:  '[' 2 '!=' 0 ']'
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:98: _common_test:  echo TEST FAILED
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stdout:TEST FAILED
2026-02-10T19:17:59.428 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:99: _common_test:  return 1
2026-02-10T19:17:59.429 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:34: run:  return 1

Furthermore, there are times where osd-backfill-recovery-log passed, but we fail in osd-backfill-space.sh

/a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43380/teuthology.log

…ry-log Add a 30-second polling loop after flush_pg_stats to wait for PG log and duplicate entries to be trimmed to their expected sizes before validation. This addresses timing issues where the test was inspecting the objectstore before log trimming operations completed. The loop polls 'ceph pg query' to check both log and dups lengths, breaking when both reach or fall below expected thresholds. This prevents spurious test failures on varied teuthology machines where log trimming happens at different speed after recovery completes. Solves intermittent failures where logs showed 50 entries instead of expected 2, and dups showed 7 instead of expected 8. Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

kamoltat · 2026-02-11T18:26:52Z

For failed osd-backfill-recovery-log.sh tests sometimes we ran into a situation where pg dump returns nothing because no PGs has been created yet:

2026-02-10T21:27:51.002 INFO:tasks.workunit.client.0.trial138.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  ceph pg dump pgs --format=json
2026-02-10T21:27:51.002 INFO:tasks.workunit.client.0.trial138.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  jq '.pg_stats[0].up[]'
2026-02-10T21:27:51.217 INFO:tasks.workunit.client.0.trial138.stderr:dumped pgs
2026-02-10T21:27:51.224 INFO:tasks.workunit.client.0.trial138.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  ceph osd out
2026-02-10T21:27:51.390 INFO:tasks.workunit.client.0.trial138.stderr:Invalid command: saw 0 of ids(<string>)..., expected at least 1
2026-02-10T21:27:51.391 INFO:tasks.workunit.client.0.trial138.stderr:osd out <ids>... :  set osd(s) <id> [<id>...] out, or use <any|all> to set all osds out
2026-02-10T21:27:51.391 INFO:tasks.workunit.client.0.trial138.stderr:Error EINVAL: invalid command

Problem: In osd-backfill-recovery-logs.sh we were trying to do `ceph osd out` on a empty value when `ceph pg dump pgs` returned nothing. Solution: In ceph-helpers.sh we created wait_for_pg_data() to wait for 30 seconds (default) for `pg dump pgs` to receive some values before proceeding. Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

kamoltat · 2026-02-11T21:06:57Z

Made some progress by patching osd-backfill-recovery-log.sh with wait for pg dumps:

https://pulpito.ceph.com/ksirivad-2026-02-11_19:21:20-rados:standalone-main-distro-default-trial/

Now we just have to figure out osd-backfill-space.sh

We are failing the test that's basically simulating 1 OSD (target OSD) receiving 2 backfills as the same time from 2 PGs that shares the same the target OSD as replica. The target OSD is almost at capacity and the expectation is that 1 PG should be active+clean (success in backfill) and the other one should be (backfill_toofull) since it fails to backfill due to full capacity of the OSD.

kamoltat · 2026-02-12T09:34:02Z

Why osd.0 does not enter backfill_toofull for PG 4.0 and 1.0

In this test scenario, osd.0 is the replica for both:

PG 4.0 → acting [4,0] (primary 4)

PG 1.0 → acting [1,0] (primary 1)

Although osd.0 logs _tentative_full type backfillfull adjust_used 2400KiB for both PGs, it never rejects the backfill reservations. Instead, it:

Receives MBackfillReserve(REQUEST) from the primary.

Responds with MBackfillReserve(GRANT) for both PGs.

Proceeds with recovery/backfill traffic (MOSDPGPush).

Later receives MBackfillReserve(RELEASE) from the primaries.

PGs converge to active+clean.

backfill_toofull is triggered on the primary when the replica rejects the reservation or is marked backfillfull/full in the OSDMap. In this run, osd.0 never denies the reservation, so the primary has no reason to transition the PG into backfill_toofull.

The _tentative_full message seen in osd.0 log is a local projected space assessment, not a hard reservation denial or ENOSPC condition. Since both reservations were granted, recovery proceeded (effectively serialized), and both PGs completed recovery successfully.

Conclusion

This behavior is expected given the current thresholds and reservation logic. To deterministically trigger backfill_toofull, the replica must:

Be marked backfillfull/full in the OSDMap before reservation, or

Explicitly deny MBackfillReserve due to space constraints.

In this test, neither condition occurred, so backfill_toofull was not entered.

kamoltat · 2026-02-12T22:21:29Z

This PR is ready for review, teuthology test passed 20/20
https://pulpito.ceph.com/ksirivad-2026-02-12_19:01:25-rados:standalone-main-distro-default-trial/

Problem: In TEST_backfill_test_sametarget test fails to keep 1 PG at backfill-toofull and 1 PG at active+clean due to not giving enough time from 1 backfill to the other In TEST_ec_backfill_multi Similar to the test_sametarget above, not enough time from 1 backfill to the other Solution: Give more time gap between one backfill request from another using sleep <duration> bash command Fixes: https://tracker.ceph.com/issues/74524 Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>

rzarzynski · 2026-02-18T19:59:08Z

qa/standalone/osd-backfill/osd-backfill-prio.sh

+        return 1
+      fi
+      count=$(expr $count + 1)
+    done


rzarzynski · 2026-02-18T20:01:24Z

qa/standalone/osd-backfill/osd-backfill-recovery-log.sh

+        echo "WARNING: Log trimming timeout after ${TIMEOUT}s - log=$current_log_len (expected <=$loglen), dups=$current_dups_len (expected <=$dupslen)"
+        break
+      fi
+    done


rzarzynski · 2026-02-18T20:04:22Z

qa/standalone/osd-backfill/osd-backfill-recovery-log.sh

-    ceph osd out $(ceph pg dump pgs --format=json | jq '.pg_stats[0].up[]')
+
+    # Wait for PG to be visible and mark out all OSDs for this pool
+    local pg_up_osds=$(wait_for_pg_data '.pg_stats[0].up[]') || return 1


sseshasa · 2026-02-23T12:36:05Z

Rados Approved.
See https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues75069

github-actions bot added core tests labels Feb 10, 2026

kamoltat force-pushed the wip-ksirivad-fix-74524 branch from 8701a02 to ec96481 Compare February 10, 2026 18:19

kamoltat marked this pull request as ready for review February 10, 2026 18:20

kamoltat requested a review from a team as a code owner February 10, 2026 18:20

kamoltat added needs-review needs-squid-backport PR needs a squid backport needs-tentacle-backport PRs that need tentacle back-port labels Feb 10, 2026

kamoltat force-pushed the wip-ksirivad-fix-74524 branch from 75a3ba7 to 6a02f6b Compare February 10, 2026 21:10

kamoltat marked this pull request as draft February 11, 2026 15:48

kamoltat force-pushed the wip-ksirivad-fix-74524 branch 3 times, most recently from 6bc7dd3 to b90f596 Compare February 12, 2026 22:19

kamoltat changed the title ~~qa/standalone: improve reliability of osd-backfill-prio test~~ qa/standalone: improve reliability of osd-backfill test Feb 12, 2026

kamoltat changed the title ~~qa/standalone: improve reliability of osd-backfill test~~ qa/standalone: improve reliability of osd-backfill tests Feb 12, 2026

kamoltat marked this pull request as ready for review February 12, 2026 22:21

kamoltat self-assigned this Feb 12, 2026

kamoltat force-pushed the wip-ksirivad-fix-74524 branch from b90f596 to 29fb42e Compare February 17, 2026 19:52

rzarzynski approved these changes Feb 18, 2026

View reviewed changes

rzarzynski added the needs-qa label Feb 18, 2026

SrinivasaBharath added the wip-bharath21-testing label Feb 19, 2026

kamoltat removed the needs-review label Feb 19, 2026

rzarzynski merged commit 157336f into ceph:main Feb 23, 2026
13 checks passed

SrinivasaBharath removed the wip-bharath21-testing label Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa/standalone: improve reliability of osd-backfill tests#67295

qa/standalone: improve reliability of osd-backfill tests#67295
rzarzynski merged 4 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-74524

kamoltat commented Feb 10, 2026 •

edited

Loading

Uh oh!

kamoltat commented Feb 10, 2026

Uh oh!

kamoltat commented Feb 10, 2026 •

edited

Loading

Uh oh!

kamoltat commented Feb 11, 2026

Uh oh!

kamoltat commented Feb 11, 2026 •

edited

Loading

Uh oh!

kamoltat commented Feb 12, 2026 •

edited

Loading

Uh oh!

kamoltat commented Feb 12, 2026

Uh oh!

rzarzynski Feb 18, 2026

Uh oh!

rzarzynski Feb 18, 2026

Uh oh!

rzarzynski Feb 18, 2026

Uh oh!

sseshasa commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kamoltat commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

kamoltat commented Feb 10, 2026

Uh oh!

kamoltat commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamoltat commented Feb 11, 2026

Uh oh!

kamoltat commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamoltat commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamoltat commented Feb 12, 2026

Uh oh!

rzarzynski Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

rzarzynski Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

rzarzynski Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

sseshasa commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamoltat commented Feb 10, 2026 •

edited

Loading

kamoltat commented Feb 10, 2026 •

edited

Loading

kamoltat commented Feb 11, 2026 •

edited

Loading

kamoltat commented Feb 12, 2026 •

edited

Loading