Skip to content

rgw/notify: fix crashes in lc due to reload of bucket.#56712

Merged
cbodley merged 1 commit intoceph:mainfrom
kchheda3:wip-notif-reload-bkt
Apr 8, 2024
Merged

rgw/notify: fix crashes in lc due to reload of bucket.#56712
cbodley merged 1 commit intoceph:mainfrom
kchheda3:wip-notif-reload-bkt

Conversation

@kchheda3
Copy link
Contributor

@kchheda3 kchheda3 commented Apr 4, 2024

Fixes https://tracker.ceph.com/issues/64571

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the rgw label Apr 4, 2024
@kchheda3 kchheda3 changed the title wip-notif-reload-bkt rgw/notify: fix crashes in lc due to reload of bucket. Apr 4, 2024
@kchheda3 kchheda3 requested a review from cbodley April 4, 2024 18:36
@kchheda3 kchheda3 self-assigned this Apr 4, 2024
@kchheda3 kchheda3 marked this pull request as ready for review April 4, 2024 18:36
@kchheda3 kchheda3 requested a review from a team as a code owner April 4, 2024 18:36
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. planning to run the rgw/lifecycle job against this several times

@mattbenjamin
Copy link
Contributor

I support taking immediately if it works and is safe, but @cbodley @yuvalif @dang do we understand why the original logic was unsafe?

@cbodley
Copy link
Contributor

cbodley commented Apr 4, 2024

@mattbenjamin that remains a mystery. my only hypothesis from #55657 (comment):

maybe the corruptions happens when we load attrs on top of existing attrs?

but i don't think anyone is actively investigating that

As part of PR# 55657, publish_reserve would reload bucket to ensure bucket_attrs are loaded. However for lc events, where the bucket attrs were already loaded, the reloading was causing crash but there was no obvious root cause, so to avoid the crashes, remove reloading of bucket in publish_reserve and put the onus on callers to load the bucket before calling publish_reserve.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
@kchheda3 kchheda3 force-pushed the wip-notif-reload-bkt branch from 109ddc5 to fa5d370 Compare April 4, 2024 19:16
@mattbenjamin
Copy link
Contributor

ok

@cbodley
Copy link
Contributor

cbodley commented Apr 5, 2024

https://pulpito.ceph.com/?suite=rgw:lifecycle all green so far 👍 full rgw suite pending in https://pulpito.ceph.com/cbodley-2024-04-05_17:23:35-rgw-wip-64571-distro-default-smithi/

@yuvalif
Copy link
Contributor

yuvalif commented Apr 8, 2024

https://pulpito.ceph.com/?suite=rgw:lifecycle all green so far 👍 full rgw suite pending in https://pulpito.ceph.com/cbodley-2024-04-05_17:23:35-rgw-wip-64571-distro-default-smithi/

  • notification suite failed with a known issue: test_persistent_ps_s3_data_path_v2_migration
  • multisite crashed:
 ceph version 19.0.0-2758-gfcc393a5 (fcc393a59651a9bc9fe543803aaf84bee75cdf65) squid (dev)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fcd02506520]
 2: /lib/librados.so.2(+0x11c185) [0x7fcd03f99185]
 3: /lib/librados.so.2(+0x10f6e4) [0x7fcd03f8c6e4]
 4: (librados::v14_2_0::IoCtx::aio_operate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::v14_2_0::AioCompletion*, librados::v14_2_0::ObjectReadOperation*, int, ceph::buffer::v15_2_0::list*)+0x85) [0x7fcd03efcc35]
 5: (rgw_rados_operate(DoutPrefixProvider const*, librados::v14_2_0::IoCtx&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, librados::v14_2_0::ObjectReadOperation*, ceph::buffer::v15_2_0::list*, optional_yield, int, opentelemetry::v1::trace::SpanContext const*)+0x40c) [0x55fa7fbc739c]
 6: (rgw::cls::fifo::FIFO::list(DoutPrefixProvider const*, int, std::optional<std::basic_string_view<char, std::char_traits<char> > >, std::vector<rgw::cls::fifo::list_entry, std::allocator<rgw::cls::fifo::list_entry> >*, bool*, optional_yield)+0x5f9) [0x55fa7fe8c899]
 7: radosgw(+0xb6d26b) [0x55fa7fd9026b]
 8: (DataLogBackends::list(DoutPrefixProvider const*, int, int, std::vector<rgw_data_change_log_entry, std::allocator<rgw_data_change_log_entry> >&, std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*, optional_yield)+0x28f) [0x55fa7fd969df]
 9: (RGWOp_DATALog_List::execute(optional_yield)+0x4c3) [0x55fa7fe234c3]
 10: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0xa3b) [0x55fa7f7ec8eb]
 11: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x21d1) [0x55fa7f7ef3c1]
 12: radosgw(+0x10fdb27) [0x55fa80320b27]
 13: radosgw(+0x532794) [0x55fa7f755794]
 14: make_fcontext()
  • the other multisite test di not show any crash, but had 26 failures (which may indicate a crash)
13902:2024-04-05T20:13:49.010 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_object_sync ... FAIL
14264:2024-04-05T20:14:49.723 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_object_delete ... FAIL
18317:2024-04-05T20:19:47.785 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_multi_object_delete ... FAIL
19627:2024-04-05T20:21:10.350 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_versioned_object_incremental_sync ... FAIL
21647:2024-04-05T20:22:13.940 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_concurrent_versioned_object_incremental_sync ... FAIL
13335304:2024-04-05T21:33:32.638 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_sync_enable_right_after_disable ... FAIL
13335904:2024-04-05T21:34:08.471 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_sync_disable_enable ... FAIL
13336348:2024-04-05T21:34:31.202 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_multipart_object_sync ... FAIL
13336516:2024-04-05T21:34:51.669 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_encrypted_object_sync ... FAIL
13336932:2024-04-05T21:35:12.358 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_index_log_trim ... FAIL
13337370:2024-04-05T21:36:33.035 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_reshard_index_log_trim ... FAIL
13337786:2024-04-05T21:36:53.715 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_reshard_incremental ... FAIL
13338447:2024-04-05T21:39:52.088 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_reshard_full ... FAIL
13339184:2024-04-05T21:40:13.306 INFO:tasks.rgw_multisite_tests:Create several generations of objects, then run bucket sync ... FAIL
13339642:2024-04-05T21:40:34.031 INFO:tasks.rgw_multisite_tests:Create several generations of objects, trash them, then run bucket sync init ... FAIL
13340594:2024-04-05T21:44:32.376 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_full_sync_after_data_sync_init ... FAIL
13342455:2024-04-05T21:50:04.760 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_resharded_bucket_full_sync_after_data_sync_init ... FAIL
13342757:2024-04-05T21:50:25.490 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_bucket_incremental_sync_after_data_sync_init ... FAIL
13344562:2024-04-05T21:53:53.100 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_resharded_bucket_incremental_sync_latest_after_data_sync_init ... FAIL
13345018:2024-04-05T21:54:03.811 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_resharded_bucket_incremental_sync_oldest_after_data_sync_init ... FAIL
13346246:2024-04-05T21:58:37.122 INFO:tasks.rgw_multisite_tests:test_sync_flow_symmetrical_zonegroup_all: ... FAIL
13346864:2024-04-05T22:00:17.036 INFO:tasks.rgw_multisite_tests:test_sync_single_bucket: ... FAIL
13347514:2024-04-05T22:02:22.312 INFO:tasks.rgw_multisite_tests:test_sync_different_buckets: ... FAIL
13348290:2024-04-05T22:04:27.786 INFO:tasks.rgw_multisite_tests:test_sync_multiple_buckets_to_single: ... FAIL
13348892:2024-04-05T22:06:33.308 INFO:tasks.rgw_multisite_tests:test_sync_single_bucket_to_multiple: ... FAIL
13349088:2024-04-05T22:06:38.857 INFO:tasks.rgw_multisite_tests:rgw_multi.tests.test_topic_notification_sync ... FAIL
26656255:2024-04-05T22:24:17.250 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_object_sync
26657302:2024-04-05T22:24:17.333 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_object_delete
26657681:2024-04-05T22:24:17.362 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_multi_object_delete
26661749:2024-04-05T22:24:17.684 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_versioned_object_incremental_sync
26663076:2024-04-05T22:24:17.786 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_concurrent_versioned_object_incremental_sync
26665113:2024-04-05T22:24:17.946 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_sync_enable_right_after_disable
26665739:2024-04-05T22:24:17.996 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_sync_disable_enable
26666356:2024-04-05T22:24:18.046 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_multipart_object_sync
26666817:2024-04-05T22:24:18.396 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_encrypted_object_sync
26666996:2024-04-05T22:24:18.411 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_index_log_trim
26667431:2024-04-05T22:24:18.445 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_reshard_index_log_trim
26667888:2024-04-05T22:24:18.481 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_reshard_incremental
26668321:2024-04-05T22:24:18.515 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_reshard_full
26668999:2024-04-05T22:24:18.568 INFO:tasks.rgw_multisite_tests:FAIL: Create several generations of objects, then run bucket sync
26669690:2024-04-05T22:24:18.624 INFO:tasks.rgw_multisite_tests:FAIL: Create several generations of objects, trash them, then run bucket sync init
26670165:2024-04-05T22:24:18.663 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_full_sync_after_data_sync_init
26670705:2024-04-05T22:24:18.705 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_resharded_bucket_full_sync_after_data_sync_init
26672583:2024-04-05T22:24:18.853 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_bucket_incremental_sync_after_data_sync_init
26672902:2024-04-05T22:24:18.883 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_resharded_bucket_incremental_sync_latest_after_data_sync_init
26674724:2024-04-05T22:24:19.028 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_resharded_bucket_incremental_sync_oldest_after_data_sync_init
26675197:2024-04-05T22:24:19.065 INFO:tasks.rgw_multisite_tests:FAIL: test_sync_flow_symmetrical_zonegroup_all:
26675865:2024-04-05T22:24:19.119 INFO:tasks.rgw_multisite_tests:FAIL: test_sync_single_bucket:
26676499:2024-04-05T22:24:19.168 INFO:tasks.rgw_multisite_tests:FAIL: test_sync_different_buckets:
26677167:2024-04-05T22:24:19.221 INFO:tasks.rgw_multisite_tests:FAIL: test_sync_multiple_buckets_to_single:
26677961:2024-04-05T22:24:19.287 INFO:tasks.rgw_multisite_tests:FAIL: test_sync_single_bucket_to_multiple:
26678581:2024-04-05T22:24:19.337 INFO:tasks.rgw_multisite_tests:FAIL: rgw_multi.tests.test_topic_notification_sync
26678776:2024-04-05T22:24:19.352 INFO:tasks.rgw_multisite_tests:FAILED (SKIP=18, errors=4, failures=26)

@cbodley cbodley merged commit 4f58bb7 into ceph:main Apr 8, 2024
@kchheda3 kchheda3 deleted the wip-notif-reload-bkt branch April 8, 2024 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants