Skip to content

crimson: PG backfill is not showing any progress#62760

Merged
Matan-B merged 1 commit intoceph:mainfrom
mohit84:pg_stuck_backfill
Apr 30, 2025
Merged

crimson: PG backfill is not showing any progress#62760
Matan-B merged 1 commit intoceph:mainfrom
mohit84:pg_stuck_backfill

Conversation

@mohit84
Copy link
Contributor

@mohit84 mohit84 commented Apr 10, 2025

There were two main issues due to that backfill not showing any progress

  1. The registry hash was not correctly handled in complete_to function.
    After completion of the tracked objects the object was not erased correctly
    from the registry.
  2. There was a race condition in wait_for_pushes and set_pushed. set_pushed is called
    before wait_for_pushes so it was not able to get a future to complete an operation.

Fixes: https://tracker.ceph.com/issues/70502
Credits: Radoslaw Zarzynski rzarzyns@redhat.com

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

@mohit84 mohit84 requested a review from a team as a code owner April 10, 2025 08:15
@mohit84 mohit84 requested a review from rzarzynski April 10, 2025 08:16
@mohit84 mohit84 requested a review from Matan-B April 10, 2025 08:16
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 10, 2025

jenkins test make check

@Matan-B Matan-B added this to Crimson Apr 10, 2025
@Matan-B Matan-B moved this to Needs QA in Crimson Apr 10, 2025
@Matan-B Matan-B moved this from Needs QA to Awaits review in Crimson Apr 10, 2025
@Matan-B Matan-B requested a review from xxhdx1985126 April 10, 2025 08:25
Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Few comments:

  • Let's separate the two fixes to two individual commits as they are not related.

  • For the first fix, IIUC, the issue is that in op_stage_t::enqueued_push cases, the registry iterator was not incremented. Right?
    I'm not sure that I understand the "registry hash was not correctly handled" description.

  • For 2, I would suggest to audit the commit message since, generally it's safe to await a future even after it has been set. The issue here however seems to be that the shared_promise might have not yet been created.
    What do you think about:

`RecoveryBackend::pushes` map creates each shared_promise in `wait_for_pushes`
call. Meaning, there can be a situation where `set_pushed` is called due to handled
push reply (`handle_push_reply) *before* the shared_promise was even constructed.
With this fix, we ensure that first the shared_promise is created properly and only
then is waited upon.
  • nit: Another "Signed-off-by" should be used rather than "Credits"

  • I'll run this through the suite prior to the changes above since looks like this one could be tested as is.

@Matan-B
Copy link
Contributor

Matan-B commented Apr 14, 2025

@mohit84 mohit84 force-pushed the pg_stuck_backfill branch from cb3495e to 724e36e Compare April 14, 2025 12:31
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 14, 2025

  • op_stage_t::enqueued_push

done

@Matan-B
Copy link
Contributor

Matan-B commented Apr 14, 2025

  • op_stage_t::enqueued_push

done

I can only see a single commit in the PR, the second one seems to miss now.
Again, I'm not sure I understand "were not erased correctly from the registry." - The commit is erasing the iterator in the same way.
The issue that was fixed is that the iterator was not incremented in op_stage_t::enqueued_push cases.

@mohit84
Copy link
Contributor Author

mohit84 commented Apr 14, 2025

  • op_stage_t::enqueued_push

done

I can only see a single commit in the PR, the second one seems to miss now. Again, I'm not sure I understand "were not erased correctly from the registry." - The commit is erasing the iterator in the same way. The issue that was fixed is that the iterator was not incremented in op_stage_t::enqueued_push cases.

Yes you are correct the commit is erasing the entries in same way but earlier iterator was not increased due to that entry
was not removed. Please let me know if you want to change it again.

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct the commit is erasing the entries in same way but earlier iterator was not increased due to that entry
was not removed. Please let me know if you want to change it again.

This is not reflected in the commit message which states: "objects were not erased correctly" - Can you please amend the commit message accordingly?
Otherwise, lgtm!

@Matan-B Matan-B moved this from Awaits review to Needs QA in Crimson Apr 14, 2025
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 15, 2025

jenkins test make check

@mohit84
Copy link
Contributor Author

mohit84 commented Apr 15, 2025

Yes you are correct the commit is erasing the entries in same way but earlier iterator was not increased due to that entry
was not removed. Please let me know if you want to change it again.

This is not reflected in the commit message which states: "objects were not erased correctly" - Can you please amend the commit message accordingly? Otherwise, lgtm!

done

@Matan-B
Copy link
Contributor

Matan-B commented Apr 15, 2025

jenkins test make check

@Matan-B
Copy link
Contributor

Matan-B commented Apr 16, 2025

I haven't seen this failure before, is it possibly related?
https://pulpito.ceph.com/matan-2025-04-16_06:54:17-crimson-rados-wip-matanb-crimson-only-testing-15-apr-62760-distro-crimson-smithi/8244620
osd.2/3

DEBUG 2025-04-16 07:21:22,345 [shard 2:main] osd -  pg_epoch 32 pg[3.0( v 26'448 (24'440,26'448] local-lis/les=31/32 n=56 ec=18/18 lis/c=31/23 les/c/f=32/24/0 sis=31) [3,2,1]/[3,2] backfill=[1] r=0 lpr=31 pi=[23,31)/1 crt=26'448 lcod 26'447 mlcod 0'0 active+undersized+degraded+remapped+backfilling  ObjectContextLoader::release_state: releasing obc ObjectContext(0x519000307a80, oid=3:053f75bb:::benchmark_data_smithi136_36603_object856:head, refcount=1), exists true oi 3:053f75bb:::benchmark_data_smithi136_36603_object856:head(24'440 client.4244.0:6856  s 65536 uv 440 alloc_hint [65536 8192 53])
DEBUG 2025-04-16 07:21:22,345 [shard 2:main] osd - enqueue_push:operator()
DEBUG 2025-04-16 07:21:22,345 [shard 2:main] osd - pg[3.0( v 26'448 (24'440,26'448] local-lis/les=31/32 n=56 ec=18/18 lis/c=31/23 les/c/f=32/24/0 sis=31) [3,2,1]/[3,2] backfill=[1] r=0 lpr=31 pi=[23,31)/1 crt=26'448 lcod 26'447 mlcod 0'0 active+undersized+degraded+remapped+backfilling  BackfillState::Waiting::react::ObjectPushed: Waiting::react() on ObjectPushed; evt.object=3:053f75bb:::benchmark_data_smithi136_36603_object856:head
DEBUG 2025-04-16 07:21:22,346 [shard 2:main] osd - pg[3.0( v 26'448 (24'440,26'448] local-lis/les=31/32 n=56 ec=18/18 lis/c=31/23 les/c/f=32/24/0 sis=31) [3,2,1]/[3,2] backfill=[1] r=0 lpr=31 pi=[23,31)/1 crt=26'448 lcod 26'447 mlcod 0'0 active+undersized+degraded+remapped+backfilling  BackfillState::ProgressTracker::complete_to: obj=3:053f75bb:::benchmark_data_smithi136_36603_object856:head
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-1435-gae49aeb6/rpm/el9/BUILD/ceph-20.0.0-1435-gae49aeb6/src/crimson/osd/backfill_state.cc:667: void crimson::osd::BackfillState::ProgressTracker::complete_to(const hobject_t&, const pg_stat_t&, bool): Assertion `soid > new_last_backfill' failed.
Aborting on shard 2.

Also here https://pulpito.ceph.com/matan-2025-04-16_06:54:17-crimson-rados-wip-matanb-crimson-only-testing-15-apr-62760-distro-crimson-smithi/8244618

Copy link
Contributor

@Matan-B Matan-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 82446108, 244618, 8244620 all fail with Assertion soid > new_last_backfill failed.
From https://pulpito.ceph.com/matan-2025-04-16_06:54:17-crimson-rados-wip-matanb-crimson-only-testing-15-apr-62760-distro-crimson-smithi/

@mohit84, can you please take a look? Thanks!

@Matan-B Matan-B moved this from Needs QA to In Progress in Crimson Apr 16, 2025
@Matan-B Matan-B self-requested a review April 27, 2025 07:07
auto &push_promise = it->second;
push_promise.set_value();
pushes.erase(it);
auto &push_promise = it->second;
Copy link
Contributor

@Matan-B Matan-B Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to my first comment, as both maybe_push_shards and push_delete are updated and are the only users of set_pushed. Can we assert pushes.contains(shard) here? Is there any other reason to keep the if instead?

lgtm otherwise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! If we assert it there is no need to also having the if right? we can use pushes.at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@mohit84 mohit84 force-pushed the pg_stuck_backfill branch from 0ad6275 to a6d48e9 Compare April 27, 2025 09:44
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 27, 2025

jenkins test make check

@mohit84 mohit84 force-pushed the pg_stuck_backfill branch from a6d48e9 to ca82cba Compare April 27, 2025 10:23
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 27, 2025

jenkins test make check

RecoveryBackend::pushes map creates each shared_promise
in wait_for_pushes call. There can be a situation where
set_pushed is called due to handled push reply (handle_push_reply)
before the shared_promise was even constructed due to backfill progress
is stuck.

Fixes: https://tracker.ceph.com/issues/70502
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
@mohit84 mohit84 force-pushed the pg_stuck_backfill branch from ca82cba to 435f065 Compare April 27, 2025 11:40
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 27, 2025

jenkins test make check

@Matan-B Matan-B moved this from In Progress to Needs QA in Crimson Apr 27, 2025
@mohit84
Copy link
Contributor Author

mohit84 commented Apr 29, 2025

@Matan-B
Copy link
Contributor

Matan-B commented Apr 29, 2025

@Matan-B
Copy link
Contributor

Matan-B commented Apr 29, 2025

jenkins test make check

@Matan-B
Copy link
Contributor

Matan-B commented Apr 29, 2025

jenkins test api

@Matan-B
Copy link
Contributor

Matan-B commented Apr 29, 2025

Hey Mohit, can you please make sure no failure is related here:

https://pulpito.ceph.com/matan-2025-04-29_13:40:47-crimson-rados-wip-matanb-crimson-testing-april-29-distro-crimson-smithi/

Otherwise we can merge!

@Matan-B Matan-B moved this from Needs QA to Tested in Crimson Apr 29, 2025
@Matan-B
Copy link
Contributor

Matan-B commented Apr 30, 2025

jenkins test api

1 similar comment
@Matan-B
Copy link
Contributor

Matan-B commented Apr 30, 2025

jenkins test api

@Matan-B Matan-B merged commit da55c9e into ceph:main Apr 30, 2025
12 checks passed
@Matan-B Matan-B moved this from Tested to Merged in Crimson Apr 30, 2025
@Matan-B Matan-B moved this from Merged Pre Tentacle Freeze to Merged (Main) in Crimson May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Merged (Main)

Development

Successfully merging this pull request may close these issues.

2 participants