Skip to content

rgw/cloud: Handle RGWRESTStreamS3PutObj initialization failures#56657

Merged
cbodley merged 1 commit intoceph:mainfrom
soumyakoduri:wip-skoduri-cloud-trans
Apr 5, 2024
Merged

rgw/cloud: Handle RGWRESTStreamS3PutObj initialization failures#56657
cbodley merged 1 commit intoceph:mainfrom
soumyakoduri:wip-skoduri-cloud-trans

Conversation

@soumyakoduri
Copy link
Contributor

@soumyakoduri soumyakoduri commented Apr 3, 2024

With the recent code added to handle connection errors (commit#e200499bb3c5703862b92a4d7fb534d98601f1bf), RGWRESTStreamS3PutObj initialization could fail at times if there were any failed requests to the cloud endpoint within CONN_STATUS_EXPIRE_SECS period.

This fix is to handle such errors and abort the transition/sync requests which can be retried later by LC/Sync worker threads.

Signed-off-by: Soumya Koduri skoduri@redhat.com
Fixes: https://tracker.ceph.com/issues/65251

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

With the recent code added to handle connection errors
(commit#e200499bb3c5703862b92a4d7fb534d98601f1bf), RGWRESTStreamS3PutObj
initialization could fail at times if there were any failed requests to the
cloud endpoint within CONN_STATUS_EXPIRE_SECS period.

This fix is to handle such errors and abort the transition/sync
requests which can be retried later by LC/Sync worker threads.

Signed-off-by: Soumya Koduri <skoduri@redhat.com>
@soumyakoduri soumyakoduri requested a review from a team as a code owner April 3, 2024 10:42
@soumyakoduri soumyakoduri requested a review from cbodley April 3, 2024 10:42
@github-actions github-actions bot added the rgw label Apr 3, 2024
@soumyakoduri
Copy link
Contributor Author

soumyakoduri commented Apr 3, 2024

@cbodley.. following up on #53320 (comment) , these changes prevent the crash. But I am wondering if this will cause the failures (https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282), which ideally need to be ignored, fail the transition repeatedly if the error gets mapped to EIO.

Can we make it conditional to not check CONN_STATUS_EXPIRE_SECS for cloud modules which deal with single endpoint per connection.

Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks correct to handle the case where we have no available endpoints

but i wouldn't expect to see any connection errors from the rgw/cloud-transition suite since we're not stopping/restarting any rgws there. i'm guessing that we still want to figure out which http error is getting to mapped to EIO to cause this in the first place

(edit: normal/expected http errors shouldn't cause cloud transitions to fail)

@soumyakoduri
Copy link
Contributor Author

but i wouldn't expect to see any connection errors from the rgw/cloud-transition suite since we're not stopping/restarting any rgws there. i'm guessing that we still want to figure out which http error is getting to mapped to EIO to cause this in the first place

I am not very sure as this issue is not consistently reproducible in teuthology runs but in my test environment I have seen the crash with op_ret=-125 which I think may be due to ECANCELED errors (version mismatch) while trying to recreate the target bucket.

(edit: normal/expected http errors shouldn't cause cloud transitions to fail)

My doubt is, for suppose, if the cloud endpoint returns EIO when we try to fetch the HEAD object (to check if the object is already present - https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_lc_tier.cc#L1282). Since it was added for optimization, we ideally ignore any error for that op and proceed to transition the object. But now the endpoint_status may fail that transition if its tried within 2 sec and this could repeatedly happen right?

@soumyakoduri
Copy link
Contributor Author

jenkins test make check arm64

@soumyakoduri
Copy link
Contributor Author

@cbodley .. can we merge this PR? the teuthology test and make check arm64 failures seem unrelated to this change.

@cbodley cbodley merged commit 6f58861 into ceph:main Apr 5, 2024
@soumyakoduri
Copy link
Contributor Author

Thanks!

@soumyakoduri soumyakoduri deleted the wip-skoduri-cloud-trans branch March 6, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants