infernalis: rgw: swift API returns more than real object count and bytes used when retrieving account metadata by Abhishekvrshny · Pull Request #6635 · ceph/ceph

Abhishekvrshny · 2015-11-18T08:21:24Z

http://tracker.ceph.com/issues/13735

Fixes: ceph#13140 Fix the bug that swift account stat command returns doubled object count and bytes used Signed-off-by: Sangdi Xu <xu.sangdi@h3c.com> (cherry picked from commit 66d19c7)

loic-bot · 2015-11-19T02:28:04Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/9382/

SUCCESS http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9382/

Abhishekvrshny · 2015-11-30T14:05:18Z

@yehudasa this backport passed the rgw suite (http://pulpito.ceph.com/loic-2015-11-27_13:58:16-rgw-infernalis-backports---basic-multi/), do you think it can be merged ?

Abhishekvrshny · 2015-12-08T15:10:26Z

@yehudasa gentle reminder.

…t count and bytes used when retrieving account metadata Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost · 2016-02-08T11:22:38Z

@yehudasa does this backport look good to merge ? It passed a run of the infernalis rgw suite ( see http://tracker.ceph.com/issues/13750#note-21 ) except one occurrence of a known bug (http://tracker.ceph.com/issues/14361).

…t count and bytes used when retrieving account metadata Reviewed-by: Loic Dachary <ldachary@redhat.com>

yehudasa · 2016-02-24T18:17:32Z

@dachary ack

Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on Debian Trixie if a client requests the removal of an RBD image from the RBD trash (ceph#6635 [0]). After a lot of investigation, the cause of this still isn't clear to me; the most likely culprit are some internal changes to Python sub-interpreters that happened between Python versions 3.12 and 3.13. What leads me to this conclusion is the following: 1. A user on our forum noted [1] that the issue disappeared as soon as they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has Python version 3.11, which is the version before any substantial changes to sub-interpreters [2][3] were made. 2. There is an upstream issue [4] regarding another segfault during MGR startup. The author concluded that this problem is related to sub-interpreters and opened another issue [5] on Python's issue tracker that goes into more detail. Even though this is for a completely different code path, it shows that issues related to sub-interpreters are popping up elsewhere at the very least. 3. The segfault happens *inside* the Python interpreter: #0 0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c) ceph#1 0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2) ceph#2 0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658) ceph#3 0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0) ceph#4 0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0) ceph#5 0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843) ceph#6 0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5) ceph#7 0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b) ceph#8 0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d) ceph#9 0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20) ceph#10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd) ceph#11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442) ceph#12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed) ceph#13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af) ceph#14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed) ceph#15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6) ceph#16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560) ceph#17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16) ceph#18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379) ceph#19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70) ceph#20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e) ceph#21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716) ceph#22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705) ceph#23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a) ceph#24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598) ceph#25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71) ceph#26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63) ceph#27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224) ceph#28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b) ceph#29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8) Note that in ceph#12, you can see that a "progress callback" is being called by librbd. This callback is a plain Python function that is passed down via Ceph's Python/C++ bindings for librbd [6]. (I'd provide more stack traces for the other threads here, but they're rather massive.) Then, from ceph#11 to ceph#4 the entire execution happens within the Python interpreter: This is just the callback being executed. The segfault happens at ceph#4 during _Py_dict_lookup(), which is a private function inside the Python interpreter to look something up in a `dict` [7]. As this function is so fundamental, it shouldn't ever fail, ever; but yet it does, which suggests that some internal interpreter state is most likely corrupted at that point. Since it's incredibly hard to debug and actually figure out what the *real* underlying issue is, simply disable that on_progress callback instead. I just hope that this doesn't move the problem somewhere else. Unless I'm mistaken, there aren't any other callbacks that get passed through C/C++ via Cython [8] like this, so this should hopefully prevent any further SIGSEGVs until this is fixed upstream (somehow). Note that this bug was also reported upstream [9]. [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635 [1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315 [2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil [3]: python/cpython#117953 [4]: https://tracker.ceph.com/issues/67696 [5]: python/cpython#138045 [6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907 [7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278 [8]: https://cython.org/ [9]: https://tracker.ceph.com/issues/72713 Fixes: ceph#6635 Signed-off-by: Max R. Carrara <m.carrara@proxmox.com>

Currently, *all* MGRs collectively segfault on Ceph v19.2.3 running on Debian Trixie if a client requests the removal of an RBD image from the RBD trash (ceph#6635 [0]). After a lot of investigation, the cause of this still isn't clear to me; the most likely culprit are some internal changes to Python sub-interpreters that happened between Python versions 3.12 and 3.13. What leads me to this conclusion is the following: 1. A user on our forum noted [1] that the issue disappeared as soon as they set up a Ceph MGR inside a Debian Bookworm VM. Bookworm has Python version 3.11, which is the version before any substantial changes to sub-interpreters [2][3] were made. 2. There is an upstream issue [4] regarding another segfault during MGR startup. The author concluded that this problem is related to sub-interpreters and opened another issue [5] on Python's issue tracker that goes into more detail. Even though this is for a completely different code path, it shows that issues related to sub-interpreters are popping up elsewhere at the very least. 3. The segfault happens *inside* the Python interpreter: #0 0x000078e04d89e95c __pthread_kill_implementation (libc.so.6 + 0x9495c) ceph#1 0x000078e04d849cc2 __GI_raise (libc.so.6 + 0x3fcc2) ceph#2 0x00005ab95de92658 reraise_fatal (/usr/bin/ceph-mgr + 0x32d658) ceph#3 0x000078e04d849df0 __restore_rt (libc.so.6 + 0x3fdf0) ceph#4 0x000078e04ef598b0 _Py_dict_lookup (libpython3.13.so.1.0 + 0x1598b0) ceph#5 0x000078e04efa1843 _PyDict_GetItemRef_KnownHash (libpython3.13.so.1.0 + 0x1a1843) ceph#6 0x000078e04efa1af5 _PyType_LookupRef (libpython3.13.so.1.0 + 0x1a1af5) ceph#7 0x000078e04efa216b _Py_type_getattro_impl (libpython3.13.so.1.0 + 0x1a216b) ceph#8 0x000078e04ef6f60d PyObject_GetAttr (libpython3.13.so.1.0 + 0x16f60d) ceph#9 0x000078e04f043f20 _PyEval_EvalFrameDefault (libpython3.13.so.1.0 + 0x243f20) ceph#10 0x000078e04ef109dd _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x1109dd) ceph#11 0x000078e04f1d3442 _PyObject_VectorcallTstate (libpython3.13.so.1.0 + 0x3d3442) ceph#12 0x000078e03b74ffed __pyx_f_3rbd_progress_callback (rbd.cpython-313-x86_64-linux-gnu.so + 0xacfed) ceph#13 0x000078e03afcc8af _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE13start_next_opEv (librbd.so.1 + 0x3cc8af) ceph#14 0x000078e03afccfed _ZN6librbd19AsyncObjectThrottleINS_8ImageCtxEE9start_opsEm (librbd.so.1 + 0x3ccfed) ceph#15 0x000078e03afafec6 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_remove_objectsEv (librbd.so.1 + 0x3afec6) ceph#16 0x000078e03afb0560 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE19send_copyup_objectsEv (librbd.so.1 + 0x3b0560) ceph#17 0x000078e03afb2e16 _ZN6librbd9operation11TrimRequestINS_8ImageCtxEE15should_completeEi (librbd.so.1 + 0x3b2e16) ceph#18 0x000078e03afae379 _ZN6librbd12AsyncRequestINS_8ImageCtxEE8completeEi (librbd.so.1 + 0x3ae379) ceph#19 0x000078e03ada8c70 _ZN7Context8completeEi (librbd.so.1 + 0x1a8c70) ceph#20 0x000078e03afcdb1e _ZN7Context8completeEi (librbd.so.1 + 0x3cdb1e) ceph#21 0x000078e04d6e4716 _ZN8librados14CB_AioCompleteclEv (librados.so.2 + 0xd2716) ceph#22 0x000078e04d6e5705 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xd3705) ceph#23 0x000078e04d6e5f8a _ZN5boost4asio19asio_handler_invokeINS0_6detail23strand_executor_service7invokerIKNS0_10io_context19basic_executor_typeISaIvELm0EEEvEEEEvRT_z (librados.so.2 + 0xd3f8a) ceph#24 0x000078e04d6fc598 _ZN5boost4asio6detail19scheduler_operation8completeEPvRKNS_6system10error_codeEm (librados.so.2 + 0xea598) ceph#25 0x000078e04d6e9a71 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE (librados.so.2 + 0xd7a71) ceph#26 0x000078e04d6fff63 _ZN5boost4asio10io_context3runEv (librados.so.2 + 0xedf63) ceph#27 0x000078e04dae1224 n/a (libstdc++.so.6 + 0xe1224) ceph#28 0x000078e04d89cb7b start_thread (libc.so.6 + 0x92b7b) ceph#29 0x000078e04d91a7b8 __clone3 (libc.so.6 + 0x1107b8) Note that in ceph#12, you can see that a "progress callback" is being called by librbd. This callback is a plain Python function that is passed down via Ceph's Python/C++ bindings for librbd [6]. (I'd provide more stack traces for the other threads here, but they're rather massive.) Then, from ceph#11 to ceph#4 the entire execution happens within the Python interpreter: This is just the callback being executed. The segfault happens at ceph#4 during _Py_dict_lookup(), which is a private function inside the Python interpreter to look something up in a `dict` [7]. As this function is so fundamental, it shouldn't ever fail, ever; but yet it does, which suggests that some internal interpreter state is most likely corrupted at that point. Since it's incredibly hard to debug and actually figure out what the *real* underlying issue is, simply disable that on_progress callback instead. I just hope that this doesn't move the problem somewhere else. Unless I'm mistaken, there aren't any other callbacks that get passed through C/C++ via Cython [8] like this, so this should hopefully prevent any further SIGSEGVs until this is fixed upstream (somehow). Note that this bug was also reported upstream [9]. [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6635 [1]: https://forum.proxmox.com/threads/ceph-managers-seg-faulting-post-upgrade-8-9-upgrade.169363/post-796315 [2]: https://docs.python.org/3.12/whatsnew/3.12.html#pep-684-a-per-interpreter-gil [3]: python/cpython#117953 [4]: https://tracker.ceph.com/issues/67696 [5]: python/cpython#138045 [6]: https://github.com/ceph/ceph/blob/c92aebb279828e9c3c1f5d24613efca272649e62/src/pybind/rbd/rbd.pyx#L878-L907 [7]: https://github.com/python/cpython/blob/282bd0fe98bf1c3432fd5a079ecf65f165a52587/Objects/dictobject.c#L1262-L1278 [8]: https://cython.org/ [9]: https://tracker.ceph.com/issues/72713 Fixes: ceph#6635 Signed-off-by: Max R. Carrara <m.carrara@proxmox.com> Link: https://lore.proxmox.com/20250910085244.123467-1-m.carrara@proxmox.com

rgw: fix swift API returning incorrect account metadata

1e82bbd

Fixes: ceph#13140 Fix the bug that swift account stat command returns doubled object count and bytes used Signed-off-by: Sangdi Xu <xu.sangdi@h3c.com> (cherry picked from commit 66d19c7)

Abhishekvrshny added bug-fix rgw labels Nov 18, 2015

Abhishekvrshny added this to the infernalis milestone Nov 18, 2015

Abhishekvrshny self-assigned this Nov 18, 2015

ghost pushed a commit that referenced this pull request Feb 8, 2016

Merge pull request #6635: rgw: swift API returns more than real objec…

48f3269

…t count and bytes used when retrieving account metadata Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost assigned yehudasa and unassigned Abhishekvrshny Feb 8, 2016

ghost pushed a commit that referenced this pull request Feb 10, 2016

Merge pull request #6635: rgw: swift API returns more than real objec…

4ebab58

…t count and bytes used when retrieving account metadata Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost changed the title ~~rgw: swift API returns more than real object count and bytes used when retrieving account metadata~~ infernalis: rgw: swift API returns more than real object count and bytes used when retrieving account metadata Feb 19, 2016

oritwas merged commit 7b2c95d into ceph:infernalis Jun 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infernalis: rgw: swift API returns more than real object count and bytes used when retrieving account metadata#6635

infernalis: rgw: swift API returns more than real object count and bytes used when retrieving account metadata#6635
oritwas merged 1 commit intoceph:infernalisfrom
Abhishekvrshny:wip-13735-infernalis

Abhishekvrshny commented Nov 18, 2015

Uh oh!

loic-bot commented Nov 19, 2015

Uh oh!

Abhishekvrshny commented Nov 30, 2015

Uh oh!

Abhishekvrshny commented Dec 8, 2015

Uh oh!

ghost commented Feb 8, 2016

Uh oh!

yehudasa commented Feb 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Abhishekvrshny commented Nov 18, 2015

Uh oh!

loic-bot commented Nov 19, 2015

Uh oh!

Abhishekvrshny commented Nov 30, 2015

Uh oh!

Abhishekvrshny commented Dec 8, 2015

Uh oh!

ghost commented Feb 8, 2016

Uh oh!

yehudasa commented Feb 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants