Skip to content

qa/standalone: improve reliability of osd-backfill tests#67295

Merged
rzarzynski merged 4 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-74524
Feb 23, 2026
Merged

qa/standalone: improve reliability of osd-backfill tests#67295
rzarzynski merged 4 commits intoceph:mainfrom
kamoltat:wip-ksirivad-fix-74524

Conversation

@kamoltat
Copy link
Member

@kamoltat kamoltat commented Feb 10, 2026

Problem:

  • Currently the osd-backfill-prio test sometimes fails to capture in-progress remote and local reservations, we suspect that this inconsistency stems from the lab migration where the hardware are now faster/slower.
  • osd-backfill-recovery-log sometimes fail to trim pg logs and create the correct pg dups
  • osd-backfill-space sometimes fail due to backfilling requests not having enough time between each other when we are backfilling 1 OSD with 2 PGs simultaneously, therefore, we didn't get the expected result that we want ( 1 PG active+clean, 1 PG backfill-toofull)

Solution:

  • For osd-backfill-prio, we increase data size from 10MB to 100MB per object to extend backfill duration, making in-progress reservations more observable
  • Add polling loop to wait for recovery reservations to start before validation, addressing timing issues on varied teuthology test machines
  • For osd-backfill-recovery-log we added a 30-second polling loop after flush_pg_stats to wait for PG log
    and duplicate entries to be trimmed to their expected sizes before
    validation. This addresses timing issues where the test was inspecting
    the objectstore before log trimming operations completed. The loop polls 'ceph pg query' to check both log and dups lengths, breaking when both reach or fall below expected thresholds. This
    prevents spurious test failures on varied teuthology machines where log
    trimming happens at different speed after recovery completes.
    Solves intermittent failures where logs showed 50 entries instead of
    expected 2, and dups showed 7 instead of expected 8.
  • For osd-backfill-space we do a sleep between each backfill request to make the test correctly create a scenario where 1 PG gets to backfill completely and the other PG failed to backfill due to backfill-toofull.

Fixes: https://tracker.ceph.com/issues/74524

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

- Increase data size from 10MB to 100MB per object to extend backfill
  duration, making in-progress reservations more observable
- Add polling loop to wait for recovery reservations to start before
  validation, addressing timing issues on varied CI machines

Fixes: https://tracker.ceph.com/issues/74524

Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-fix-74524 branch from 8701a02 to ec96481 Compare February 10, 2026 18:19
@kamoltat kamoltat marked this pull request as ready for review February 10, 2026 18:20
@kamoltat kamoltat requested a review from a team as a code owner February 10, 2026 18:20
@kamoltat kamoltat added needs-review needs-squid-backport PR needs a squid backport needs-tentacle-backport PRs that need tentacle back-port labels Feb 10, 2026
@kamoltat
Copy link
Member Author

@kamoltat
Copy link
Member Author

kamoltat commented Feb 10, 2026

Running tests: https://pulpito.ceph.com/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/

So we actually fixed the failures in osd-backfill-prio.sh which is the first script that gets executed in standalone/osd-backfill. We made progress but hit another failure in:

osd-backfill-recovery-log
TLDR;

  1. we expect pg_log to be trimmed to length 2 but we got length 50 instead
  2. we expect a dups length of 8 but got 7

/a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43379/teuthology.log

2026-02-10T19:17:59.393 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:85: _common_test:  jq '.pg_log_t.log | length' td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:85: _common_test:  LOGLEN=50
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:86: _common_test:  '[' 50 '!=' 2 ']'
2026-02-10T19:17:59.404 INFO:tasks.workunit.client.0.trial196.stdout:FAILED: Wrong log length got 50 (expected 2)
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:87: _common_test:  echo 'FAILED: Wrong log length got 50 (expected 2)'
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:88: _common_test:  expr 0 + 1
2026-02-10T19:17:59.405 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:88: _common_test:  ERRORS=1
2026-02-10T19:17:59.406 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:90: _common_test:  jq '.pg_log_t.dups | length' td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.416 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:90: _common_test:  DUPSLEN=7
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:91: _common_test:  '[' 7 '!=' 8 ']'
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stdout:FAILED: Wrong dups length got 7 (expected 8)
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:92: _common_test:  echo 'FAILED: Wrong dups length got 7 (expected 8)'
2026-02-10T19:17:59.417 INFO:tasks.workunit.client.0.trial196.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:93: _common_test:  expr 1 + 1
2026-02-10T19:17:59.418 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:93: _common_test:  ERRORS=2
2026-02-10T19:17:59.418 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:95: _common_test:  grep 'copy_up_to\|copy_after' td/osd-backfill-recovery-log/osd.0.log td/osd-backfill-recovery-log/osd.1.log td/osd-backfill-recovery-log/osd.2.log td/osd-backfill-recovery-log/osd.3.log td/osd-backfill-recovery-log/osd.4.log td/osd-backfill-recovery-log/osd.5.log
2026-02-10T19:17:59.426 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:96: _common_test:  rm -f td/osd-backfill-recovery-log/result.log
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:97: _common_test:  '[' 2 '!=' 0 ']'
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:98: _common_test:  echo TEST FAILED
2026-02-10T19:17:59.427 INFO:tasks.workunit.client.0.trial196.stdout:TEST FAILED
2026-02-10T19:17:59.428 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:99: _common_test:  return 1
2026-02-10T19:17:59.429 INFO:tasks.workunit.client.0.trial196.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:34: run:  return 1

Furthermore, there are times where osd-backfill-recovery-log passed, but we fail in osd-backfill-space.sh

/a/ksirivad-2026-02-10_18:47:48-rados:standalone-main-distro-default-trial/43380/teuthology.log

…ry-log

Add a 30-second polling loop after flush_pg_stats to wait for PG log
and duplicate entries to be trimmed to their expected sizes before
validation. This addresses timing issues where the test was inspecting
the objectstore before log trimming operations completed.

The loop polls 'ceph pg query' to check both log and dups lengths,
breaking when both reach or fall below expected thresholds. This
prevents spurious test failures on varied teuthology machines where log
trimming happens at different speed after recovery completes.

Solves intermittent failures where logs showed 50 entries instead of
expected 2, and dups showed 7 instead of expected 8.

Fixes: https://tracker.ceph.com/issues/74524

Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-fix-74524 branch from 75a3ba7 to 6a02f6b Compare February 10, 2026 21:10
@kamoltat kamoltat marked this pull request as draft February 11, 2026 15:48
@kamoltat
Copy link
Member Author

For failed osd-backfill-recovery-log.sh tests sometimes we ran into a situation where pg dump returns nothing because no PGs has been created yet:

2026-02-10T21:27:51.002 INFO:tasks.workunit.client.0.trial138.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  ceph pg dump pgs --format=json
2026-02-10T21:27:51.002 INFO:tasks.workunit.client.0.trial138.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  jq '.pg_stats[0].up[]'
2026-02-10T21:27:51.217 INFO:tasks.workunit.client.0.trial138.stderr:dumped pgs
2026-02-10T21:27:51.224 INFO:tasks.workunit.client.0.trial138.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd-backfill/osd-backfill-recovery-log.sh:68: _common_test:  ceph osd out
2026-02-10T21:27:51.390 INFO:tasks.workunit.client.0.trial138.stderr:Invalid command: saw 0 of ids(<string>)..., expected at least 1
2026-02-10T21:27:51.391 INFO:tasks.workunit.client.0.trial138.stderr:osd out <ids>... :  set osd(s) <id> [<id>...] out, or use <any|all> to set all osds out
2026-02-10T21:27:51.391 INFO:tasks.workunit.client.0.trial138.stderr:Error EINVAL: invalid command

Problem:
In osd-backfill-recovery-logs.sh we were
trying to do `ceph osd out` on a empty value
when `ceph pg dump pgs` returned nothing.

Solution:

In ceph-helpers.sh we created wait_for_pg_data()
to wait for 30 seconds (default) for `pg dump pgs`
to receive some values before proceeding.

Fixes: https://tracker.ceph.com/issues/74524

Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
@kamoltat
Copy link
Member Author

kamoltat commented Feb 11, 2026

Made some progress by patching osd-backfill-recovery-log.sh with wait for pg dumps:

https://pulpito.ceph.com/ksirivad-2026-02-11_19:21:20-rados:standalone-main-distro-default-trial/

Now we just have to figure out osd-backfill-space.sh

We are failing the test that's basically simulating 1 OSD (target OSD) receiving 2 backfills as the same time from 2 PGs that shares the same the target OSD as replica. The target OSD is almost at capacity and the expectation is that 1 PG should be active+clean (success in backfill) and the other one should be (backfill_toofull) since it fails to backfill due to full capacity of the OSD.

@kamoltat
Copy link
Member Author

kamoltat commented Feb 12, 2026

Why osd.0 does not enter backfill_toofull for PG 4.0 and 1.0

In this test scenario, osd.0 is the replica for both:

PG 4.0 → acting [4,0] (primary 4)

PG 1.0 → acting [1,0] (primary 1)

Although osd.0 logs _tentative_full type backfillfull adjust_used 2400KiB for both PGs, it never rejects the backfill reservations. Instead, it:

Receives MBackfillReserve(REQUEST) from the primary.

Responds with MBackfillReserve(GRANT) for both PGs.

Proceeds with recovery/backfill traffic (MOSDPGPush).

Later receives MBackfillReserve(RELEASE) from the primaries.

PGs converge to active+clean.

backfill_toofull is triggered on the primary when the replica rejects the reservation or is marked backfillfull/full in the OSDMap. In this run, osd.0 never denies the reservation, so the primary has no reason to transition the PG into backfill_toofull.

The _tentative_full message seen in osd.0 log is a local projected space assessment, not a hard reservation denial or ENOSPC condition. Since both reservations were granted, recovery proceeded (effectively serialized), and both PGs completed recovery successfully.

Conclusion

This behavior is expected given the current thresholds and reservation logic. To deterministically trigger backfill_toofull, the replica must:

Be marked backfillfull/full in the OSDMap before reservation, or

Explicitly deny MBackfillReserve due to space constraints.

In this test, neither condition occurred, so backfill_toofull was not entered.

@kamoltat kamoltat force-pushed the wip-ksirivad-fix-74524 branch 3 times, most recently from 6bc7dd3 to b90f596 Compare February 12, 2026 22:19
@kamoltat kamoltat changed the title qa/standalone: improve reliability of osd-backfill-prio test qa/standalone: improve reliability of osd-backfill test Feb 12, 2026
@kamoltat kamoltat changed the title qa/standalone: improve reliability of osd-backfill test qa/standalone: improve reliability of osd-backfill tests Feb 12, 2026
@kamoltat
Copy link
Member Author

This PR is ready for review, teuthology test passed 20/20
https://pulpito.ceph.com/ksirivad-2026-02-12_19:01:25-rados:standalone-main-distro-default-trial/

@kamoltat kamoltat marked this pull request as ready for review February 12, 2026 22:21
@kamoltat kamoltat self-assigned this Feb 12, 2026
Problem:
In TEST_backfill_test_sametarget
test fails to keep 1 PG at backfill-toofull
and 1 PG at active+clean due to not giving enough
time from 1 backfill to the other

In TEST_ec_backfill_multi
Similar to the test_sametarget above, not enough
time from 1 backfill to the other

Solution:
Give more time gap between one backfill request
from another using sleep <duration> bash command

Fixes: https://tracker.ceph.com/issues/74524

Signed-off-by: Kamoltat (Junior) Sirivadhna <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-fix-74524 branch from b90f596 to 29fb42e Compare February 17, 2026 19:52
return 1
fi
count=$(expr $count + 1)
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

echo "WARNING: Log trimming timeout after ${TIMEOUT}s - log=$current_log_len (expected <=$loglen), dups=$current_dups_len (expected <=$dupslen)"
break
fi
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

ceph osd out $(ceph pg dump pgs --format=json | jq '.pg_stats[0].up[]')

# Wait for PG to be visible and mark out all OSDs for this pool
local pg_up_osds=$(wait_for_pg_data '.pg_stats[0].up[]') || return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

@sseshasa
Copy link
Contributor

@rzarzynski rzarzynski merged commit 157336f into ceph:main Feb 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core needs-qa needs-squid-backport PR needs a squid backport needs-tentacle-backport PRs that need tentacle back-port tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants