crimson: PG backfill is not showing any progress#62760
Conversation
|
jenkins test make check |
There was a problem hiding this comment.
Great catch! Few comments:
-
Let's separate the two fixes to two individual commits as they are not related.
-
For the first fix, IIUC, the issue is that in
op_stage_t::enqueued_pushcases, the registry iterator was not incremented. Right?
I'm not sure that I understand the "registry hash was not correctly handled" description. -
For 2, I would suggest to audit the commit message since, generally it's safe to await a future even after it has been set. The issue here however seems to be that the shared_promise might have not yet been created.
What do you think about:
`RecoveryBackend::pushes` map creates each shared_promise in `wait_for_pushes`
call. Meaning, there can be a situation where `set_pushed` is called due to handled
push reply (`handle_push_reply) *before* the shared_promise was even constructed.
With this fix, we ensure that first the shared_promise is created properly and only
then is waited upon.
-
nit: Another "Signed-off-by" should be used rather than "Credits"
-
I'll run this through the suite prior to the changes above since looks like this one could be tested as is.
cb3495e to
724e36e
Compare
done |
I can only see a single commit in the PR, the second one seems to miss now. |
Yes you are correct the commit is erasing the entries in same way but earlier iterator was not increased due to that entry |
Matan-B
left a comment
There was a problem hiding this comment.
Yes you are correct the commit is erasing the entries in same way but earlier iterator was not increased due to that entry
was not removed. Please let me know if you want to change it again.
This is not reflected in the commit message which states: "objects were not erased correctly" - Can you please amend the commit message accordingly?
Otherwise, lgtm!
08dd40c to
07679a4
Compare
|
jenkins test make check |
done |
|
jenkins test make check |
|
I haven't seen this failure before, is it possibly related? |
There was a problem hiding this comment.
See 82446108, 244618, 8244620 all fail with Assertion soid > new_last_backfill failed.
From https://pulpito.ceph.com/matan-2025-04-16_06:54:17-crimson-rados-wip-matanb-crimson-only-testing-15-apr-62760-distro-crimson-smithi/
@mohit84, can you please take a look? Thanks!
src/crimson/osd/recovery_backend.h
Outdated
| auto &push_promise = it->second; | ||
| push_promise.set_value(); | ||
| pushes.erase(it); | ||
| auto &push_promise = it->second; |
There was a problem hiding this comment.
Going back to my first comment, as both maybe_push_shards and push_delete are updated and are the only users of set_pushed. Can we assert pushes.contains(shard) here? Is there any other reason to keep the if instead?
lgtm otherwise!
There was a problem hiding this comment.
Thanks! If we assert it there is no need to also having the if right? we can use pushes.at
0ad6275 to
a6d48e9
Compare
|
jenkins test make check |
a6d48e9 to
ca82cba
Compare
|
jenkins test make check |
RecoveryBackend::pushes map creates each shared_promise in wait_for_pushes call. There can be a situation where set_pushed is called due to handled push reply (handle_push_reply) before the shared_promise was even constructed due to backfill progress is stuck. Fixes: https://tracker.ceph.com/issues/70502 Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
ca82cba to
435f065
Compare
|
jenkins test make check |
|
The suite is finished at the location there are some jobs are failed but those are already known issues |
|
jenkins test make check |
|
jenkins test api |
|
Hey Mohit, can you please make sure no failure is related here: Otherwise we can merge! |
|
jenkins test api |
1 similar comment
|
jenkins test api |
I have checked total 4 jobs are failing and 3 of them are known issue and 1 is a new assert failing but not is related to this. for this i will open a tracker bug |
There were two main issues due to that backfill not showing any progress
After completion of the tracked objects the object was not erased correctly
from the registry.
before wait_for_pushes so it was not able to get a future to complete an operation.
Fixes: https://tracker.ceph.com/issues/70502
Credits: Radoslaw Zarzynski rzarzyns@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job Definition