rgw/cloud: Handle RGWRESTStreamS3PutObj initialization failures#56657
rgw/cloud: Handle RGWRESTStreamS3PutObj initialization failures#56657
Conversation
With the recent code added to handle connection errors (commit#e200499bb3c5703862b92a4d7fb534d98601f1bf), RGWRESTStreamS3PutObj initialization could fail at times if there were any failed requests to the cloud endpoint within CONN_STATUS_EXPIRE_SECS period. This fix is to handle such errors and abort the transition/sync requests which can be retried later by LC/Sync worker threads. Signed-off-by: Soumya Koduri <skoduri@redhat.com>
|
@cbodley.. following up on #53320 (comment) , these changes prevent the crash. But I am wondering if this will cause the failures (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282), which ideally need to be ignored, fail the transition repeatedly if the error gets mapped to EIO. Can we make it conditional to not check CONN_STATUS_EXPIRE_SECS for cloud modules which deal with single endpoint per connection. |
There was a problem hiding this comment.
looks correct to handle the case where we have no available endpoints
but i wouldn't expect to see any connection errors from the rgw/cloud-transition suite since we're not stopping/restarting any rgws there. i'm guessing that we still want to figure out which http error is getting to mapped to EIO to cause this in the first place
(edit: normal/expected http errors shouldn't cause cloud transitions to fail)
I am not very sure as this issue is not consistently reproducible in teuthology runs but in my test environment I have seen the crash with op_ret=-125 which I think may be due to ECANCELED errors (version mismatch) while trying to recreate the target bucket.
My doubt is, for suppose, if the cloud endpoint returns EIO when we try to fetch the HEAD object (to check if the object is already present - https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282). Since it was added for optimization, we ideally ignore any error for that op and proceed to transition the object. But now the endpoint_status may fail that transition if its tried within 2 sec and this could repeatedly happen right? |
|
jenkins test make check arm64 |
|
@cbodley .. can we merge this PR? the teuthology test and make check arm64 failures seem unrelated to this change. |
|
Thanks! |
With the recent code added to handle connection errors (commit#e200499bb3c5703862b92a4d7fb534d98601f1bf), RGWRESTStreamS3PutObj initialization could fail at times if there were any failed requests to the cloud endpoint within CONN_STATUS_EXPIRE_SECS period.
This fix is to handle such errors and abort the transition/sync requests which can be retried later by LC/Sync worker threads.
Signed-off-by: Soumya Koduri skoduri@redhat.com
Fixes: https://tracker.ceph.com/issues/65251
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e