Skip to content

qa/rgw/multisite: remove boto2 BotoJSONEncoder#67486

Merged
cbodley merged 3 commits intoceph:mainfrom
cbodley:wip-qa-rgw-multisite-boto2-decoder
Feb 26, 2026
Merged

qa/rgw/multisite: remove boto2 BotoJSONEncoder#67486
cbodley merged 3 commits intoceph:mainfrom
cbodley:wip-qa-rgw-multisite-boto2-decoder

Conversation

@cbodley
Copy link
Contributor

@cbodley cbodley commented Feb 24, 2026

removes leftover boto2 stuff to resolve run-tox-qa error:

tasks/rgw_multi/tools.py:52: error: Name "boto.s3.user.User" is not defined [name-defined]

Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@cbodley cbodley mentioned this pull request Feb 24, 2026
14 tasks
@cbodley cbodley requested review from a team and smanjara February 24, 2026 14:44
@smanjara
Copy link
Contributor

I was thinking of getting rid of this file entirely because tests.py doesn't use it anymore after #67011, and the only method being used is append_query_arg(), which can be moved inside its user tests_es.py

@cbodley cbodley force-pushed the wip-qa-rgw-multisite-boto2-decoder branch from 1298b35 to f111d67 Compare February 24, 2026 19:27
@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

I was thinking of getting rid of this file entirely because tests.py doesn't use it anymore after #67011, and the only method being used is append_query_arg(), which can be moved inside its user tests_es.py

good point. how does these new commits look?

  • qa/rgw/multisite: move append_query_arg() to zone_es.py
  • qa/rgw/multisite: remove unused assert_raises() and tools.py

@cbodley cbodley force-pushed the wip-qa-rgw-multisite-boto2-decoder branch from f111d67 to a92f3f5 Compare February 24, 2026 19:29
@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

i think something's busted with multisite tests on main.. i see a bunch of tests failing in https://qa-proxy.ceph.com/teuthology/cbodley-2026-02-24_19:39:04-rgw:multisite-wip-74591-distro-default-trial/68706/teuthology.log with:

An error occurred (AccessDenied) when calling the CreateBucket operation

and then segfaults on shutdown

i checked another recent run in https://pulpito.ceph.com/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/ and see the same in https://qa-proxy.ceph.com/teuthology/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/67966/teuthology.log

@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

but neither rgw/multisite job in https://pulpito.ceph.com/cbodley-2026-02-24_19:39:04-rgw:multisite-wip-74591-distro-default-trial/ showed python errors, so i don't think there's a regression in this pr

@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

An error occurred (AccessDenied) when calling the CreateBucket operation

oh, must be because #67083 hasn't merged yet

@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

The following tests FAILED:
227 - unittest_peeringstate (Failed)

reported in https://tracker.ceph.com/issues/75144

@cbodley
Copy link
Contributor Author

cbodley commented Feb 24, 2026

jenkins test make check arm64

@smanjara
Copy link
Contributor

An error occurred (AccessDenied) when calling the CreateBucket operation

oh, must be because #67083 hasn't merged yet

yeah, #66203 is blocked as well. it's currently being testing in wip-anrao3-testing

@smanjara
Copy link
Contributor

and then segfaults on shutdown

i checked another recent run in https://pulpito.ceph.com/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/ and see the same in https://qa-proxy.ceph.com/teuthology/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/67966/teuthology.log

just before the shutdown here:

2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 10: (ceph::common::RefCountedObject::put() const+0x115) [0x7ff640c65d05]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 11: radosgw(+0x170764b) [0x55e4814fe64b]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 12: radosgw(+0x6e4dd1) [0x55e4804dbdd1]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 13: (rgw::AppMain::~AppMain()+0x4d3) [0x55e4804e93f3]

I see:

2026-02-24T13:27:31.908 INFO:tasks.rgw.c1.client.1.trial076.stdout:radosgw: ./obj-x86_64-linux-gnu/boost/include/boost/intrusive/list.hpp:1311
: boost::intrusive::list_impl<ValueTraits, SizeType, ConstantTimeSize, HeaderHolder>::iterator boost::intrusive::list_impl<ValueTraits, SizeTy
pe, ConstantTimeSize, HeaderHolder>::iterator_to(reference) [with ValueTraits = boost::intrusive::bhtraits<neorados::Notifier, boost::intrusiv
e::list_node_traits<void*>, boost::intrusive::safe_link, ceph::async::service_tag, 1>; SizeType = long unsigned int; bool ConstantTimeSize = t
rue; HeaderHolder = void; iterator = boost::intrusive::list_iterator<boost::intrusive::bhtraits<neorados::Notifier, boost::intrusive::list_nod
e_traits<void*>, boost::intrusive::safe_link, ceph::async::service_tag, 1>, false>; reference = neorados::Notifier&]: Assertion `!node_algorit
hms::inited(this->priv_value_traits().to_node_ptr(value))' failed.

following the class Notifier{} -> service_list_base_hook, it's using boost::intrusive::list_base_hook. during Notifier shutdown in https://github.com/ceph/ceph/blob/main/src/common/async/service.h#L49, we clear neoref and Client is destroyed. And then again, during the Notifier destructor call, we invoke svc.remove(), https://github.com/ceph/ceph/blob/main/src/common/async/service.h#L66. here we do check if the entries list is empty or not, if not empty we go ahead and remove the list entries. but there could be other entries still in the service list, so that condition alone might not be sufficient. we could be calling iterator_to() on a non-existent element here?

from the first line before the crash, it looks like we are using Intrusive hook with safe_link. should we check if the entry exists or not using is_linked() before trying to erase it from the list?

@cbodley
Copy link
Contributor Author

cbodley commented Feb 25, 2026

jenkins test make check

@cbodley
Copy link
Contributor Author

cbodley commented Feb 25, 2026

jenkins test make check arm64

1 similar comment
@cbodley
Copy link
Contributor Author

cbodley commented Feb 25, 2026

jenkins test make check arm64

@cbodley
Copy link
Contributor Author

cbodley commented Feb 25, 2026

jenkins test make check

Signed-off-by: Casey Bodley <cbodley@redhat.com>
removes leftover boto2 stuff to resolve `run-tox-qa` error:

> tasks/rgw_multi/tools.py:52: error: Name "boto.s3.user.User" is not defined [name-defined]

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley cbodley force-pushed the wip-qa-rgw-multisite-boto2-decoder branch from a92f3f5 to afd5c02 Compare February 25, 2026 19:51
@cbodley
Copy link
Contributor Author

cbodley commented Feb 26, 2026

jenkins test make check

1 similar comment
@cbodley
Copy link
Contributor Author

cbodley commented Feb 26, 2026

jenkins test make check

@cbodley cbodley merged commit ee6f2ea into ceph:main Feb 26, 2026
13 checks passed
@smanjara
Copy link
Contributor

and then segfaults on shutdown
i checked another recent run in https://pulpito.ceph.com/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/ and see the same in https://qa-proxy.ceph.com/teuthology/nithyab-2026-02-24_12:43:35-rgw-wip-nbalacha-lua-74219-distro-default-trial/67966/teuthology.log

just before the shutdown here:

2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 10: (ceph::common::RefCountedObject::put() const+0x115) [0x7ff640c65d05]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 11: radosgw(+0x170764b) [0x55e4814fe64b]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 12: radosgw(+0x6e4dd1) [0x55e4804dbdd1]
2026-02-24T13:27:31.910 INFO:tasks.rgw.c1.client.1.trial076.stdout: 13: (rgw::AppMain::~AppMain()+0x4d3) [0x55e4804e93f3]

I see:

2026-02-24T13:27:31.908 INFO:tasks.rgw.c1.client.1.trial076.stdout:radosgw: ./obj-x86_64-linux-gnu/boost/include/boost/intrusive/list.hpp:1311
: boost::intrusive::list_impl<ValueTraits, SizeType, ConstantTimeSize, HeaderHolder>::iterator boost::intrusive::list_impl<ValueTraits, SizeTy
pe, ConstantTimeSize, HeaderHolder>::iterator_to(reference) [with ValueTraits = boost::intrusive::bhtraits<neorados::Notifier, boost::intrusiv
e::list_node_traits<void*>, boost::intrusive::safe_link, ceph::async::service_tag, 1>; SizeType = long unsigned int; bool ConstantTimeSize = t
rue; HeaderHolder = void; iterator = boost::intrusive::list_iterator<boost::intrusive::bhtraits<neorados::Notifier, boost::intrusive::list_nod
e_traits<void*>, boost::intrusive::safe_link, ceph::async::service_tag, 1>, false>; reference = neorados::Notifier&]: Assertion `!node_algorit
hms::inited(this->priv_value_traits().to_node_ptr(value))' failed.

following the class Notifier{} -> service_list_base_hook, it's using boost::intrusive::list_base_hook. during Notifier shutdown in https://github.com/ceph/ceph/blob/main/src/common/async/service.h#L49, we clear neoref and Client is destroyed. And then again, during the Notifier destructor call, we invoke svc.remove(), https://github.com/ceph/ceph/blob/main/src/common/async/service.h#L66. here we do check if the entries list is empty or not, if not empty we go ahead and remove the list entries. but there could be other entries still in the service list, so that condition alone might not be sufficient. we could be calling iterator_to() on a non-existent element here?

from the first line before the crash, it looks like we are using Intrusive hook with safe_link. should we check if the entry exists or not using is_linked() before trying to erase it from the list?

@cbodley is this segfault already fixed in some other PR or should I dig further into this?

@cbodley
Copy link
Contributor Author

cbodley commented Feb 26, 2026

@cbodley is this segfault already fixed in some other PR or should I dig further into this?

tracked in https://tracker.ceph.com/issues/75164 and discussed in https://ceph-storage.slack.com/archives/C05LPHSKVPG/p1771994911452119 where Seena and Adam both have patches to propose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants