osd: Try other PGs when reservation failures occur#38853
osd: Try other PGs when reservation failures occur#38853dzafman merged 2 commits intoceph:masterfrom
Conversation
|
I removed the local reservation pre-check. This makes the extremely rare "local reserve failed" a normal scenario. These log lines shows that things are working as expected. |
ronen-fr
left a comment
There was a problem hiding this comment.
I would recommend not spreading the "local resource locking" handling outside of
PgScrubber::m_local_osd_resource
|
|
||
| [[nodiscard]] bool is_scrub_active() const final { return m_active; } | ||
|
|
||
| [[nodiscard]] bool get_reserve_failed() const final { return m_reserve_failed; } |
There was a problem hiding this comment.
Should be part of the m_local_osd_resource object.
Seems to me that the clearing API shouldn't exist.
There was a problem hiding this comment.
This flag represents a temporary state of the PG. When other scrubs finish, reservations can succeed. This flag is a way to cycle through all possible scrubs to try to find one that can run now. We have to keep trying in order, so after 1 cycle we clear all flags to start over.
|
Not sure I understand the idea: if to mark PGs that failed to reserve their replicas - good. And calling |
Most of the time if local reservation aren't available then osd/OSD.cc:7569 pre-check (doesn't grab any reservations) will fast out of OSD::sched_scrub(). In a very rare race where the last reservation is grabbed by a remote scrub before we try to get a local reservation then reserve_failed set in PG.cc:1364 is detected at osd/OSD.cc:7652 and we can just jump out of the loop. |
I still didn't get what we gain by setting the flag on local reservation failure, but will be happy to take this discussion offline. |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
| goto out; | ||
| } | ||
| // If this is set now we must have had a local reserve failure, so can't scrub anything right now | ||
| if (pg->get_reserve_failed()) { |
There was a problem hiding this comment.
@ronen-fr This check is why I mark a PG for local reservation failure too.
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Fixes: https://tracker.ceph.com/issues/48843 Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
|
jenkins test make check |
|
https://pulpito.ceph.com/nojha-2021-03-06_16:46:51-rados-wip-zafman-testing-distro-basic-smithi/ This run had an error occur 55(62) times which is being investigated, but blocked many tests. |
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox