mgr/rbd_support: recover from rados client blocklisting#49742
mgr/rbd_support: recover from rados client blocklisting#49742
Conversation
a2531e2 to
0ba629b
Compare
14995b5 to
167b87f
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Fixed them |
|
jenkins test make check |
|
jenkins test make check arm64 |
1 similar comment
|
jenkins test make check arm64 |
|
Overall, I think there is room for improvement in the tests (probably OK to defer to another PR though):
|
|
@idryomov accidentally requested for re-review. pls ignore |
|
Current tests appear to be stable with the following fixups: |
... requests to be completed. Signed-off-by: Ramana Raja <rraja@redhat.com>
Signed-off-by: Ramana Raja <rraja@redhat.com>
In certain scenarios the OSDs were slow to process RBD requests. This lead to the rbd_support module's RBD client not being able to gracefully handover a RBD exclusive lock to another RBD client. After the condition persisted for some time, the other RBD client forcefully acquired the lock by blocklisting the rbd_support module's RBD client, and consequently blocklisted the module's RADOS client. The rbd_support module stopped working. To recover the module, the entire mgr service had to be restarted which reloaded other mgr modules. Instead of recovering the rbd_support module from client blocklisting by being disruptive to other mgr modules, recover the module automatically without restarting the mgr serivce. On client getting blocklisted, shutdown the module's handlers and blocklisted client, create a new rados client for the module, and start the new handlers. Fixes: https://tracker.ceph.com/issues/56724 Signed-off-by: Ramana Raja <rraja@redhat.com>
... after the module's RADOS client is blocklisted. Signed-off-by: Ramana Raja <rraja@redhat.com>
Created tracker ticket for now, https://tracker.ceph.com/issues/59681 |
|
No related failures: https://pulpito.ceph.com/dis-2023-05-08_22:48:48-rbd-wip-dis-testing-distro-default-smithi/ This is with #49975 excluded in the last rerun -- it's causing "Exiting scrub checking -- not all pgs scrubbed." errors. Per @neha-ojha the plan is to introduce a more aggressive QoS profile for teuthology tests. |
In certain scenarios the OSDs were slow to process RBD requests.
This lead to the rbd_support module's RBD client not being able to
gracefully handover a RBD exclusive lock to another RBD client.
After the condition persisted for some time, the other RBD client
forcefully acquired the lock by blocklisting the rbd_support module's
RBD client, and consequently blocklisted the module's RADOS client. The
rbd_support module stopped working. To recover the module, the entire
mgr service had to be restarted which reloaded other mgr modules.
Instead of recovering the rbd_support module from client blocklisting
by being disruptive to other mgr modules, recover the module
automatically without restarting the mgr serivce. On client getting
blocklisted, shutdown the module's handlers and blocklisted client,
create a new rados client for the module, and start the new handlers.
Fixes: https://tracker.ceph.com/issues/56724
Signed-off-by: Ramana Raja rraja@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows