rgw/cloud-restore [PART2] : Add Restore support from Glacier/Tape cloud endpoints by soumyakoduri · Pull Request #62713 · ceph/ceph

soumyakoduri · 2025-04-07T16:59:41Z

This PR is a continuation of #61745. With this update, we now support restoring objects from Glacier-like cloud endpoints that have longer restore times — for example, AWS Glacier Flexible Retrieval with restore type: Standard.

Summary of changes

Refactored existing restore code to consolidate and move all restore processing into rgw_restore* file/class
This new class can now define restore state to be stored and processed asynchronously by worker thread.
The RESTORE class methods are abstracted to be used by all SALs similar to LC
for SAL_RADOS, FIFO is used to store and read restore requests state

Currently, this PR handles storing state of restore requests sent to cloud-glacier tier-type which need async processing.
The changes are tested with AWS Glacier Flexible Retrieval with tier_type Expedited and Standard.

TODO: (pending items)

Store the restore state of all regular requests too so that they can be retried post RGW service restarts.
optimize the code further to avoid using serializer while processing restore entries

Fixes: https://tracker.ceph.com/issues/70628
Signed-off-by: Soumya Koduri skoduri@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

github-actions · 2025-04-07T17:00:01Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

soumyakoduri · 2025-04-11T02:24:06Z

src/rgw/driver/rados/rgw_sal_rados.h

+  int num_objs;
+  librados::IoCtx& ioctx;
+  using centries = std::vector<ceph::buffer::list>;
+  ceph::containers::tiny_vector<LazyFIFO> fifos;


@adamemerson I used LazyFIFO (from rgw_log_backing.h) to store restore entries similar to datalog. After rebase I noticed that you updated LazyFIFO to use neorados. Do you suggest I update this RadosRestore too with neorados or do I need to create new wrapper for FIFO class ?

soumyakoduri · 2025-06-09T16:41:49Z

jenkins test make check arm64

soumyakoduri · 2025-06-09T16:42:03Z

jenkins test signed

soumyakoduri · 2025-06-09T16:45:48Z

@mattbenjamin @adamemerson @dang @thotz ..This PR is now ready and contains all the commits merged into downstream till now along with one additional commit (the top commit [rgw/restore: Update to neorados FIFO routines] ) which handles the changes needed to use new neorados/FIFO routines in the upstream.

Please review.

thotz · 2025-06-10T14:33:53Z

src/rgw/driver/rados/rgw_sal_rados.cc

  for (auto i=0; i < num_objs; i++) {
-    std::unique_ptr<rgw::cls::fifo::FIFO> fifo_tmp;
-    ret = rgw::cls::fifo::FIFO::create(dpp, ioctx, obj_names[i], &fifo_tmp, y);
+    std::unique_ptr<fifo::FIFO> fifo_tmp;


ret is no longer used in this function

cleaned this up.

thotz · 2025-06-10T14:37:34Z

src/rgw/driver/rados/rgw_sal_rados.cc

-
-    return 1;
+  }
+  return 1;


for better readablity need to move return 1 to if condition

but we cannot return true until all the shards are processed and are found empty .

thotz · 2025-06-10T14:38:58Z

src/rgw/rgw_restore.cc

-done:
  lock.unlock();

  return 0;


are we ignoring ret values intentionally here?

thanks..addressed it.

thotz

Some minor suggestion regarding handling of ret values in various places

thotz

LGTM

github-actions · 2025-06-30T08:48:22Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

soumyakoduri · 2025-07-02T15:54:22Z

latest teuthology results - http://pulpito.front.sepia.ceph.com/soumyakoduri-2025-07-01_06:26:07-rgw-wip-skoduri-restore-glacier-distro-default-smithi/detail

soumyakoduri · 2025-07-03T09:07:02Z

@adamemerson @thotz .. I addressed review comments and rebase conflicts.. kindly review

src/rgw/driver/rados/rgw_sal_rados.cc

soumyakoduri · 2025-07-04T06:21:40Z

jenkins test docs

soumyakoduri · 2025-07-04T12:42:04Z

jenkins test make check

…cier/Tape endpoint Restoration of objects from certain cloud services (like Glacier/Tape) could take significant amount of time (even days). Hence store the state of such restore requests and periodically process them. Brief summary of changes * Refactored existing restore code to consolidate and move all restore processing into rgw_restore* file/class * RGWRestore class is defined to manage the restoration of objects. * Lastly, for SAL_RADOS, FIFO is used to store and read restore entries. Currently, this PR handles storing state of restore requests sent to cloud-glacier tier-type which need async processing. The changes are tested with AWS Glacier Flexible Retrieval with tier_type Expedited and Standard. Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Reviewed-by: Daniel Gryniewicz <dang@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>

In case adding restore entry to FIFO fails, reset the `restore_status` of that object as "RestoreFailed" so that restore process can be retried from the end S3 user. Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>

In addition, added some more debug statements and done code cleanup Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>

Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>

Use new neorados/FIFO routines to store restore state. Note: Old librados ioctx is also still retained as it is needed by RestoreRadosSerializer. Signed-off-by: Soumya Koduri <skoduri@redhat.com>

Signed-off-by: Soumya Koduri <skoduri@redhat.com>

soumyakoduri · 2025-07-04T18:16:52Z

src/rgw/rgw_zone.h


  void encode(bufferlist& bl) const {
-    ENCODE_START(16, 1, bl);
+    ENCODE_START(17, 1, bl);


@cbodley @adamemerson .. while trying to backport these changes to tentacle, I observed that the ENCODE/DECODE version will be 16 for restore_pool (same as in downstream 8.1 codebase) in that branch as opposed to 17 in main . Would this cause any issue during upgrade later from tentacle to next upcoming release?

To resolve it, can we include "restore_pool" along with "dedup_pool" in the same ENCODE version i.e, 16 here in this PR?

thanks @soumyakoduri

upstream squid is on the v15 encoding and has up to group_pool: https://github.com/ceph/ceph/blob/squid/src/rgw/driver/rados/rgw_zone.h#L156-L184

i see that #64264 is pending backport to tentacle, adding v16 for dedup_pool (cc @benhanokh)

does 8.1 add dedup_pool for v16 too? if not, we'd probably need tentacle to include a v16 with only restore_pool to match 8.1, and bump the dedup_pool change to v17 so we could support upgrade from 8.1 to 9

thanks @cbodley ...

yes.. 8.1 doesn't have dedup_pool. Its just restore_pool in v16.
So shall I make changes to reverse the version (i.e, v16 for restore_pool and v17 for dedup_pool ) in main and backport changes in the same order to tentacle?

#64360 is the backport of this PR to tentacle..

#64360 is the backport of this PR to tentacle and PR64361 fixes the versions in main

soumyakoduri · 2025-07-06T16:53:37Z

jenkins make check arm64

soumyakoduri · 2025-07-06T16:54:59Z

jenkins test make check arm64

ivancich · 2025-07-07T18:07:25Z

@soumyakoduri @adamemerson @thotz -- This PR seems to be causing testing failures.

https://qa-proxy.ceph.com/teuthology/anuchaithra-2025-07-04_16:46:39-rgw-wip-anrao2-testing-2025-07-04-1031-distro-default-smithi/8369939/teuthology.log

soumyakoduri · 2025-07-08T06:13:45Z

@soumyakoduri @adamemerson @thotz -- This PR seems to be causing testing failures.

https://qa-proxy.ceph.com/teuthology/anuchaithra-2025-07-04_16:46:39-rgw-wip-anrao2-testing-2025-07-04-1031-distro-default-smithi/8369939/teuthology.log

@ivancich , as you had mentioned in the slack, these tests were run before this PR was merged. So this PR hasn't caused any regressions. Moreover, teuthology tests had passed when tested with this PR changes (#62713 (comment)).

I am unsure at the moment why all the restore tests had failed in the log link you pasted above. I have requested Anuchaitra to rerun the tests and see if it consistently fails. Probably any PR in that testing branch may be causing the failures.

soumyakoduri requested a review from a team as a code owner April 7, 2025 16:59

github-actions bot added build/ops common rgw tests needs-rebase labels Apr 7, 2025

soumyakoduri requested review from adamemerson, dang, mattbenjamin and thotz and removed request for a team April 7, 2025 17:09

soumyakoduri commented Apr 11, 2025

View reviewed changes

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from 1616429 to b5b5c7d Compare June 7, 2025 17:12

soumyakoduri changed the title ~~[WIP]rgw/cloud-restore [PART2] : Add Restore support from Glacier/Tape cloud endpoints~~ rgw/cloud-restore [PART2] : Add Restore support from Glacier/Tape cloud endpoints Jun 7, 2025

github-actions bot added config-change and removed needs-rebase labels Jun 7, 2025

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from b5b5c7d to 985d81a Compare June 9, 2025 06:46

thotz reviewed Jun 10, 2025

View reviewed changes

adamemerson self-assigned this Jun 10, 2025

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch 3 times, most recently from a6b1efd to c4f1305 Compare June 11, 2025 20:04

thotz approved these changes Jun 12, 2025

View reviewed changes

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch 2 times, most recently from 342130f to aad3024 Compare June 27, 2025 18:10

github-actions bot added the needs-rebase label Jun 30, 2025

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from aad3024 to 22e8eff Compare June 30, 2025 18:22

github-actions bot removed the needs-rebase label Jun 30, 2025

adamemerson approved these changes Jul 3, 2025

View reviewed changes

src/rgw/driver/rados/rgw_sal_rados.cc Show resolved Hide resolved

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from 22e8eff to 6a8631d Compare July 4, 2025 06:03

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from 6a8631d to d4dbb6f Compare July 4, 2025 07:21

thotz approved these changes Jul 4, 2025

View reviewed changes

soumyakoduri added 6 commits July 4, 2025 18:18

rgw/restore: Use strtoull to read size till 2^64

b3c867a

Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>

rgw/restore: Update to neorados FIFO routines

faf06bc

Use new neorados/FIFO routines to store restore state. Note: Old librados ioctx is also still retained as it is needed by RestoreRadosSerializer. Signed-off-by: Soumya Koduri <skoduri@redhat.com>

rgw/cloud-restore: Update doc with new options added

a981b4c

Signed-off-by: Soumya Koduri <skoduri@redhat.com>

soumyakoduri force-pushed the wip-skoduri-restore-glacier branch from d4dbb6f to a981b4c Compare July 4, 2025 12:48

soumyakoduri commented Jul 4, 2025

View reviewed changes

This was referenced Jul 6, 2025

[rgw][tentacle] Add Restore support from Glacier/Tape cloud endpoints #64360

Merged

rgw: Fix the version of restore_pool and dedup_pool #64361

Merged

soumyakoduri merged commit a79e02a into ceph:main Jul 7, 2025
13 checks passed

soumyakoduri deleted the wip-skoduri-restore-glacier branch March 6, 2026 09:09

Conversation

soumyakoduri commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumyakoduri commented Jun 9, 2025

Uh oh!

soumyakoduri commented Jun 9, 2025

Uh oh!

soumyakoduri commented Jun 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thotz left a comment

Choose a reason for hiding this comment

Uh oh!

thotz left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

soumyakoduri commented Jul 2, 2025

Uh oh!

soumyakoduri commented Jul 3, 2025

Uh oh!

Uh oh!

soumyakoduri commented Jul 4, 2025

Uh oh!

soumyakoduri commented Jul 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbodley Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumyakoduri Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumyakoduri commented Jul 6, 2025

Uh oh!

soumyakoduri commented Jul 6, 2025

Uh oh!

Uh oh!

ivancich commented Jul 7, 2025

Uh oh!

soumyakoduri commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

soumyakoduri commented Apr 7, 2025 •

edited

Loading

cbodley Jul 4, 2025 •

edited

Loading

soumyakoduri Jul 6, 2025 •

edited

Loading