Skip to content

rgw/rgw_rados: fix server side-copy orphans tail-objects#58213

Merged
cbodley merged 3 commits intoceph:mainfrom
benhanokh:ref_count
Jul 30, 2024
Merged

rgw/rgw_rados: fix server side-copy orphans tail-objects#58213
cbodley merged 3 commits intoceph:mainfrom
benhanokh:ref_count

Conversation

@benhanokh
Copy link
Contributor

@benhanokh benhanokh commented Jun 23, 2024

Server-side copy copies the manifest from SRC->DST after incrementing the ref_count on the SRC tail-objects. The problem was that it uses the task-id as ref_tag and it got lost after the ref increment.When deleting the tail-objects from the COPY it passed in the obj_tag as ref_tag and since it was different from the one used during ref_inc it was ignored and the tail objects were never removed. This PR changes the ref_tag used for ref increment passing an MD5 of the bucket-name and the oid which is recreated when the COPY object is deleted

Fixes: https://tracker.ceph.com/issues/66286

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@benhanokh benhanokh self-assigned this Jun 23, 2024
@benhanokh benhanokh requested review from a team as code owners June 23, 2024 16:17
@benhanokh benhanokh requested a review from cbodley June 23, 2024 16:20
@github-actions github-actions bot added the tests label Jun 26, 2024
@benhanokh benhanokh requested a review from mattbenjamin June 26, 2024 13:06
@benhanokh
Copy link
Contributor Author

jenkins test make check

@dang
Copy link
Contributor

dang commented Jun 27, 2024

@benhanokh I posted #58320 that should fix the obj_ctx issue for copy_object(). Can you review and test it please?

@cbodley
Copy link
Contributor

cbodley commented Jun 27, 2024

thanks @dang, i tested #58320 in a vstart cluster with:

$ s3cmd mb s3://testbucket
$ s3cmd put 128m.iso s3://testbucket
$ s3cmd cp s3://testbucket/128m.iso s3://testbucket/128m.bak
$ bin/radosgw-admin object stat --bucket testbucket --object 128m.iso
$ bin/radosgw-admin object stat --bucket testbucket --object 128m.bak

before these changes, the second object stat command output did not include "user.rgw.tail_tag". it now does, and that tag is different from the attr returned by 128m.iso as expected

$ s3cmd rm s3://testbucket/128m.iso
$ s3cmd rm s3://testbucket/128m.bak
$ bin/radosgw-admin gc process --include-all
$ bin/rados -p default.rgw.buckets.data ls

after deleting both copies and running gc, 'rados ls' shows no orphaned objects 👍

the test_rgw_orphan_list.sh changes in this pr look like a great way to validate the fix. @benhanokh if you're satisfied with Dan's approach, would you mind pulling that commit into this pr so we can get it through qa?

@benhanokh
Copy link
Contributor Author

@cbodley, assuming you need the TAIL-TAG in the OBJ-COPY then Dan's approach will work better.
Do you think the ref-tag could be hashed to something smaller (like I do in my code) or there is a reason we need 64 bytes long string tags?

In any case Dan's PR needs to add test cases for deleting all copies and making sure no orphans are left behind (can take the test cases from my PR)

@cbodley
Copy link
Contributor

cbodley commented Jul 3, 2024

Do you think the ref-tag could be hashed to something smaller (like I do in my code) or there is a reason we need 64 bytes long string tags?

i think this would need further investigation. a hash based on bucket+object name might not do what we want in the case of overwrites:

upload object bucket/A
copy bucket/A -> bucket/B
copy bucket/A -> bucket/B

in that case, the tail objects from bucket/A would get two refs with the same bucket/B tag. but i think the intent is for each ref to have a different tag so that garbage collection is idempotent - if gc crashes in the middle of derefs, we'll restart and deref the same tail objects again. if we use a separate tag for each ref, then the second deref will be ignored (its ref count already went to 0)

@cbodley
Copy link
Contributor

cbodley commented Jul 3, 2024

In any case Dan's PR needs to add test cases for deleting all copies and making sure no orphans are left behind (can take the test cases from my PR)

@benhanokh i was thinking that you could take Dan's commit into this pr since we want your test cases and rados refcount command. but the opposite would also work - please coordinate with @dang

@benhanokh
Copy link
Contributor Author

In any case Dan's PR needs to add test cases for deleting all copies and making sure no orphans are left behind (can take the test cases from my PR)

@benhanokh i was thinking that you could take Dan's commit into this pr since we want your test cases and rados refcount command. but the opposite would also work - please coordinate with @dang

done

@Svelar
Copy link
Member

Svelar commented Jul 9, 2024

jenkins test make check arm64

@Svelar
Copy link
Member

Svelar commented Jul 9, 2024

jenkins test make check

@cbodley
Copy link
Contributor

cbodley commented Jul 9, 2024

@benhanokh can you please remove your own changes from the branch (ie squash "rollback changes" into the ealier commits) and cherry-pick Dan's commit to preserve his commit title, authorship, and signoff?

@cbodley
Copy link
Contributor

cbodley commented Jul 11, 2024

in the meantime, i ran this through the rgw suite: https://pulpito.ceph.com/cbodley-2024-07-10_21:34:12-rgw-wip-66286-distro-default-smithi/

in https://qa-proxy.ceph.com/teuthology/cbodley-2024-07-10_21:34:12-rgw-wip-66286-distro-default-smithi/7795944/teuthology.log, test_rgw_orphan_list.sh failed after an osd crashed with No space left on device

2024-07-11T00:43:54.835 INFO:tasks.ceph.osd.0.smithi117.stderr:2024-07-11T00:43:54.834+0000 7fcf95a40640 -1 bdev(0x559e0aec2e00 /var/lib/ceph/osd/ceph-0/block) _aio_thread got r=-28 ((28) No space left on device)
2024-07-11T00:43:54.836 INFO:tasks.ceph.osd.0.smithi117.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-4914-ga38a2372/rpm/el9/BUILD/ceph-19.0.0-4914-ga38a2372/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7fcf95a40640 time 2024-07-11T00:43:54.835438+0000
2024-07-11T00:43:54.836 INFO:tasks.ceph.osd.0.smithi117.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-4914-ga38a2372/rpm/el9/BUILD/ceph-19.0.0-4914-ga38a2372/src/blk/kernel/KernelDevice.cc: 666: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")
2024-07-11T00:43:54.836 INFO:tasks.ceph.osd.0.smithi117.stderr: ceph version 19.0.0-4914-ga38a2372 (a38a23721a6485d87d9a339a64f3539a1382b378) squid (dev)
2024-07-11T00:43:54.836 INFO:tasks.ceph.osd.0.smithi117.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc6) [0x559e08ccef8b]

earlier in the log there were a bunch of s3cmd processes killed, i'm not sure whether that's expected or related:

2024-07-10T22:13:03.901 INFO:tasks.workunit.client.0.smithi117.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 39232 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-10T22:14:15.379 INFO:tasks.workunit.client.0.smithi117.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 39315 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-10T22:14:23.402 INFO:tasks.workunit.client.0.smithi117.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43339 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo

@benhanokh
Copy link
Contributor Author

@cbodley,
The tests are writing "huge objects" of 5GB each.
My tests added 3x3 s3cmd put for it with a total of 45GB (in addition to the 30GB written by the previous 2 tests)
We perform recursive delete after each test which should limit the used space, but the GC is only called after all the tests.

Maybe we should call GC after each test to limit the used space?

@cbodley
Copy link
Contributor

cbodley commented Jul 11, 2024

Maybe we should call GC after each test to limit the used space?

sounds good

benhanokh and others added 2 commits July 15, 2024 06:56
Improve display of ref_count in the rados commandline utility

New test cases were added to detect behavior after server side copy in the following cases:
1) delete original only
2) delete destination only
3) delete original then delete destination (this will lead to orphaned tail-objects without the changes made in this PR)
d) delete destination then delete original (this will lead to orphaned tail-objects without the changes made in this PR)

Add call to GC between tests to help control the used disk space since we keep writing huge files of 5GB each
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
Fixes: https://tracker.ceph.com/issues/66286
(Line added by Gabriel)

In RadosStore, the source and dest objects in the copy_object() call
used to share an obj_ctx.  When obj_ctx was removed from the SAL API,
they each got their own, but RGWRados::copy_obj() still assumed they
shared one.

Pass in each one separately, and use the correct one for further calls.

Signed-off-by: Daniel Gryniewicz <dang@fprintf.net>
Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
@benhanokh
Copy link
Contributor Author

jenkins test make check

@cbodley
Copy link
Contributor

cbodley commented Jul 16, 2024

https://pulpito.ceph.com/cbodley-2024-07-16_17:46:19-rgw-wip-66286-distro-default-smithi/

shows lots of s3test failures because the branch is missing some fixes from main:

2024-07-16T18:32:47.041 INFO:teuthology.orchestra.run.smithi096.stdout:=========================== short test summary info ============================
2024-07-16T18:32:47.041 INFO:teuthology.orchestra.run.smithi096.stdout:FAILED s3tests_boto3/functional/test_s3.py::test_multipart_single_get_part - ...
2024-07-16T18:32:47.041 INFO:teuthology.orchestra.run.smithi096.stdout:FAILED s3tests_boto3/functional/test_s3.py::test_object_checksum_sha256 - Key...
2024-07-16T18:32:47.041 INFO:teuthology.orchestra.run.smithi096.stdout:FAILED s3tests_boto3/functional/test_s3.py::test_multipart_checksum_sha256 - ...
2024-07-16T18:32:47.042 INFO:teuthology.orchestra.run.smithi096.stdout:FAILED s3tests_boto3/functional/test_s3.py::test_multipart_checksum_3parts - ...
2024-07-16T18:32:47.042 INFO:teuthology.orchestra.run.smithi096.stdout:FAILED s3tests_boto3/functional/test_s3.py::test_post_object_upload_checksum

those should go away with a rebase

but test_rgw_orphan_list.sh failed with what looks like ENOSPC again: http://qa-proxy.ceph.com/teuthology/cbodley-2024-07-16_17:46:19-rgw-wip-66286-distro-default-smithi/7804039/teuthology.log

@cbodley
Copy link
Contributor

cbodley commented Jul 22, 2024

@ivancich regarding your question in https://tracker.ceph.com/issues/66286#note-6, we're trying to get these orphan test cases passing. they keep causing osds to fill up and crash with ENOSPC, despite the addition of gc process commands. i also noticed that the latest run of test_rgw_orphan_list.sh took over 3 hours before failing this way

can you think of other ways to reduce the storage requirements of this testing? does it need to operate at this scale to provide sufficient regression test coverage?

separately, i've noticed a lot of odd errors from the teuthology logs - any idea what's happening here?

2024-07-16T18:24:56.783 INFO:tasks.workunit.client.0.smithi122.stdout:Bucket 's3://incomplete-mp-bkt-1/' created
2024-07-16T18:25:05.867 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 38928 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:16.144 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 39011 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:24.154 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43035 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:32.169 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43050 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:40.180 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43065 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:48.192 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43080 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:26:56.210 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43095 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:27:04.211 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43110 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:27:12.231 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43125 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:27:20.262 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43140 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:27:28.264 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43155 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo
2024-07-16T18:27:36.255 INFO:tasks.workunit.client.0.smithi122.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rgw/test_rgw_orphan_list.sh: line 158: 43170 Killed                  s3cmd --config=${s3config} put $local_file s3://${remote_bkt}/${remote_obj} --progress --multipart-chunk-size-mb=5 > $fifo

(this error is repeated ~1000 times)

@benhanokh
Copy link
Contributor Author

benhanokh commented Jul 22, 2024

My machine can't test latest code because it is centos 8.
I'm trying to get it moved to centos 9.
In the meanwhile I can add more calls to GC between tests 4 and 5, and between tests 6 and 7.

Maybe I can also nreduce the huge-obj size from 5GB down to 1GB?

@cbodley
Copy link
Contributor

cbodley commented Jul 22, 2024

Maybe I can also nreduce the huge-obj size from 5GB down to 1GB?

is there a reason to use 'huge' objects, other than the fact that they contain at least one tail object? maybe 5MB would suffice there

@benhanokh
Copy link
Contributor Author

This is how the original tests were used
We can probably scale them down to 50MB giving us a dozen tail objects without overloading the system

…rom 5100MB -> 51MB

Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
@benhanokh
Copy link
Contributor Author

benhanokh commented Jul 23, 2024

My machine is still waiting for an upgrade from centos 8 so I was unable to test anything.
This is a single line change shrinking the object-size used for orphan-list test by a factor of 100 from 5100MB -> 51MB.

I hope it will solve the out-of-space problem.
If not we will need to wait until my machine is upgraded and I can run the tests locally

obj_refcount oref;
auto p = bl.cbegin();
decode(oref, p);
for (auto itr = oref.refs.begin(); itr != oref.refs.end(); itr++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cbegin() & cend()?

@cbodley
Copy link
Contributor

cbodley commented Jul 23, 2024

yay, latest job succeeded with 0 potential orphans found out of a possible 4319 (0%).

i also noticed that the latest run of test_rgw_orphan_list.sh took over 3 hours before failing this way

better still, the total runtime was reduced to 0:26:55

@cbodley
Copy link
Contributor

cbodley commented Jul 23, 2024

also scheduled the updated test case against packages from main to verify that it catches the refcount issue:

$ teuthology-suite -m smithi -s rgw:tools --ceph-repo https://github.com/ceph/ceph.git -c main -S 86385b1 --suite-repo https://github.com/ceph/ceph-ci.git --suite-branch wip-66286 -p 75

pending in https://pulpito.ceph.com/cbodley-2024-07-23_16:43:49-rgw:tools-main-distro-default-smithi/

@cbodley
Copy link
Contributor

cbodley commented Jul 23, 2024

also scheduled the updated test case against packages from main to verify that it catches the refcount issue:

that did indeed detect the refcount orphans: 28 potential orphans found out of a possible 4347 (0%). 👍

@cbodley
Copy link
Contributor

cbodley commented Jul 24, 2024

jenkins test api

@cbodley
Copy link
Contributor

cbodley commented Jul 30, 2024

passed the rgw suite in https://pulpito.ceph.com/cbodley-2024-07-30_11:48:11-rgw-wip-cbodley-testing-distro-default-smithi/

we want this on squid by the next rc (which branches tomorrow) so i'm not blocking this on @ronen-fr's review

Comment on lines +2776 to +2777
string s(bl.c_str(), bl.length());
cout << s << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last std::endl is superfluous and is breaking cephfs tests.
The std::endl did not exist earlier.
Is there any specific requirement to have an EOL here or could we change it to std::flush ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! whatever it takes to fix the cephfs test. reverting to cout << s; should be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the rados suite caught this too, see #59025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants