Project

General

Profile

Actions

Bug #73750

closed

rados/basic: Segmentation fault during neorados tests

Added by Laura Flores 4 months ago. Updated about 6 hours ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Fixed In:
v20.3.0-6282-g1330473ff4
Released In:
Upkeep Timestamp:
2026-03-20T21:51:33+00:00

Description

Description: rados:basic/{ceph clusters/{fixed-2} mon_election/connectivity msgr-failures/many msgr/async-v2only objectstore/{bluestore/{alloc$/{avl} base mem$/{low} onode-segment$/{none} write$/{random/{compr$/{yes$/{zlib}} random}}}} rados supported-random-distro$/{rpm_latest} tasks/rados_api_tests}

/a/yaarit-2025-11-06_20:06:52-rados:basic-wip-rocky10-branch-of-the-day-2025-11-05-1762369819-distro-default-smithi/8587283

2025-11-06T20:26:12.710 INFO:tasks.workunit.client.0.smithi032.stdout:                snapshots: [ RUN      ] NeoRadosSelfManagedSnaps.Rollback
2025-11-06T20:26:12.786 INFO:tasks.workunit.client.0.smithi032.stderr:bash: line 1: 41451 Segmentation fault      (core dumped) ceph_test_neorados_snapshots --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/neorados_snapshots.xml 2>&1
2025-11-06T20:26:12.786 INFO:tasks.workunit.client.0.smithi032.stderr:     41452 Done                    | tee ceph_test_neorados_snapshots.log
2025-11-06T20:26:12.786 INFO:tasks.workunit.client.0.smithi032.stderr:     41453 Done                    | sed "s/^/                snapshots: /" 
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: Running main() from gmock_main.cc
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [==========] Running 15 tests from 1 test suite.
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [----------] Global test environment set-up.
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [----------] 15 tests from NeoRadosReadOps
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.SetOpFlags
2025-11-06T20:26:14.943 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.SetOpFlags (2311 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.AssertExists
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.AssertExists (3034 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.AssertVersion
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.AssertVersion (3011 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.CmpXattr
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.CmpXattr (2879 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.Read
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.Read (3047 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.Checksum
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.Checksum (2964 ms)
2025-11-06T20:26:14.944 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.RWOrderedRead
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.RWOrderedRead (3064 ms)
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.ShortRead
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.ShortRead (3170 ms)
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.Exec
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.Exec (2833 ms)
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.Stat
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.Stat (3127 ms)
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.Omap
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [       OK ] NeoRadosReadOps.Omap (2954 ms)
2025-11-06T20:26:14.945 INFO:tasks.workunit.client.0.smithi032.stdout:          read_operations: [ RUN      ] NeoRadosReadOps.OmapNuls
2025-11-06T20:26:15.112 INFO:tasks.workunit.client.0.smithi032.stderr:bash: line 1: 41434 Segmentation fault      (core dumped) ceph_test_neorados_read_operations --gtest_output=xml:/home/ubuntu/cephtest/archive/unit_test_xml_report/neorados_read_operations.xml 2>&1
2025-11-06T20:26:15.113 INFO:tasks.workunit.client.0.smithi032.stderr:     41435 Done                    | tee ceph_test_neorados_read_operations.log
2025-11-06T20:26:15.113 INFO:tasks.workunit.client.0.smithi032.stderr:     41436 Done                    | sed "s/^/          read_operations: /" 

This occurred during initial rocky 10 testing, which has not yet been officially added to the suite. This test does use rocky 10 packages, so it could be related.

Two coredumps are available at /a/yaarit-2025-11-06_20:06:52-rados:basic-wip-rocky10-branch-of-the-day-2025-11-05-1762369819-distro-default-smithi/8587283/remote/smithi032/coredump.


Related issues 4 (1 open3 closed)

Related to RADOS - Bug #50371: Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_ppResolvedBrad Hubbard

Actions
Related to Ceph QA - QA Run #73749: wip-lflores-testing-4-2025-12-01-1527 (old wip-rocky10-branch-of-the-day-2025-11-05-1762369819)QA ClosedLaura FloresActions
Has duplicate rgw - Bug #73758: rocky 10: test_rgw_datalog.sh fails with segfaultDuplicateAdam Emerson

Actions
Blocks mgr - Bug #73930: ceph-mgr modules rely on deprecated python subinterpretersNew

Actions
Actions #1

Updated by Laura Flores 4 months ago

  • Related to Bug #50371: Segmentation fault (core dumped) ceph_test_rados_api_watch_notify_pp added
Actions #2

Updated by Laura Flores 4 months ago

Added a similar bug we solved in the past; perhaps the same type of analysis can be used here.

Actions #3

Updated by Laura Flores 4 months ago

  • Description updated (diff)
Actions #4

Updated by Laura Flores 4 months ago

  • Description updated (diff)
Actions #5

Updated by Laura Flores 4 months ago · Edited

Steps to create an environment to analyze the coredump:

1. Visit https://quay.ceph.io/repository/ceph-ci/ceph

2. Click "Tags" 

3. Search for the branch name "wip-rocky10-branch-of-the-day-2025-11-05-1762369819" in "Filter Tags..." 

4. Select "Fetch Tags" dropdown and copy podman/docker command (I used podman)

5. Go to a machine docker/podman and with access to the core file (use "scp" to copy it onto a machine if needed)

6. Run:
$ podman pull quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:wip-rocky10-branch-of-the-day-2025-11-05-1762369819-rockylinux-10
$ podman run -it quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:wip-rocky10-branch-of-the-day-2025-11-05-1762369819-rockylinux-10

7. Create a repo for the packages:
$ vi /etc/yum.repos.d/ceph-dev.repo

8. Copy the following into the file:
[ceph]
name=ceph packages for $basearch
baseurl=https://1.chacra.ceph.com/r/ceph/wip-rocky10-branch-of-the-day-2025-11-05-1762369819/e96cbb7d09b133c651085a973a70c7d75650b6a0/rocky/10/flavors/default/$basearch
enabled=1
gpgcheck=0
type=rpm-md

[ceph-noarch]
name=ceph noarch packages
baseurl=https://1.chacra.ceph.com/r/ceph/wip-rocky10-branch-of-the-day-2025-11-05-1762369819/e96cbb7d09b133c651085a973a70c7d75650b6a0/rocky/10/flavors/default/noarch
enabled=1
gpgcheck=0
type=rpm-md

[ceph-source]
name=ceph source packages
baseurl=https://1.chacra.ceph.com/r/ceph/wip-rocky10-branch-of-the-day-2025-11-05-1762369819/e96cbb7d09b133c651085a973a70c7d75650b6a0/rocky/10/flavors/default/SRPMS
enabled=1
gpgcheck=0
type=rpm-md

9. Update package manager:
$ dnf update

10. Install executable and debuginfo:
$ dnf install ceph-test ceph-test-debuginfo

11. Run gdb:
$ gdb /usr/bin/ceph_test_neorados_snapshots -c 1762460772.41451.core -d /root/rpmbuild/BUILD/ceph-20.3.0-3904-ge96cbb7d

For the first core file, I got the following backtrace. "ss" seems to possibly have corrupted memory:

$ gdb /usr/bin/ceph_test_neorados_snapshots -c 1762460772.41451.core -d /root/rpmbuild/BUILD/ceph-20.3.0-3904-ge96cbb7d
...
...
Core was generated by `ceph_test_neorados_snapshots --gtest_output=xml:/home/ubuntu/cephtest/archive/u'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000558710f219b4 in NeoRadosSelfManagedSnaps_Rollback_Test::CoTestBody(_ZN38NeoRadosSelfManagedSnaps_Rollback_Test10CoTestBodyEv.Frame *) (
    frame_ptr=0x55874cb9d470) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/snapshots.cc:193
193      neorados::SnapSet ss;
[Current thread is 1 (LWP 41451)]
(gdb) bt
#0  0x0000558710f219b4 in NeoRadosSelfManagedSnaps_Rollback_Test::CoTestBody(_ZN38NeoRadosSelfManagedSnaps_Rollback_Test10CoTestBodyEv.Frame *) (
    frame_ptr=0x55874cb9d470) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/snapshots.cc:193
#1  0x0000558710f408e7 in std::__n4861::coroutine_handle<void>::resume (this=<optimized out>) at /usr/include/c++/14/coroutine:137
#2  boost::asio::detail::awaitable_frame_base<boost::asio::any_io_executor>::resume (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/awaitable.hpp:499
#3  boost::asio::detail::awaitable_thread<boost::asio::any_io_executor>::pump (this=0x7fffca0ea7d0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/awaitable.hpp:769
#4  boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>::operator()<boost::system::error_code> (this=0x7fffca0ea7d0, 
    arg=...) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/use_awaitable.hpp:103
#5  0x0000558710f345c9 in boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > >::operator()<boost::system::error_code> (this=0x7fffca0ea7d0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/consign.hpp:49
#6  boost::asio::detail::any_completion_handler_impl<boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > > >::call<boost::system::error_code> (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/any_completion_handler.hpp:190
#7  boost::asio::detail::any_completion_handler_call_fn<void (boost::system::error_code)>::impl<boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > > >(boost::asio::detail::any_completion_handler_impl_base*, boost::system::error_code) (
    impl=<optimized out>, args#0=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/any_completion_handler.hpp:220
#8  0x00007f17b4cb4fac in void boost::asio::detail::executor_function::complete<boost::asio::detail::binder0<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code> >, std::allocator<void> >(boost::asio::detail::executor_function::impl_base*, bool) ()
   from /usr/lib64/ceph/libceph-common.so.2
#9  0x0000558710f34798 in boost::asio::detail::executor_function::operator() (this=<synthetic pointer>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/executor_function.hpp:61
#10 boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>::execute<boost::asio::detail::executor_function> (this=0x55874ca67520, f=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:192
#11 0x00007f17b4cb1e88 in boost::asio::detail::work_dispatcher<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code>, boost::asio::any_completion_executor, void>::operator()() () from /usr/lib64/ceph/libceph-common.so.2
#12 0x00007f17b4cb6caa in boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code>, boost::asio::any_completion_executor, void>, boost::asio::any_completion_handler_allocator<void, void (boost::system::error_code)>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () from /usr/lib64/ceph/libceph-common.so.2
#13 0x0000558710fcc38d in boost::asio::detail::scheduler_operation::complete (this=0x7f1714039b90, owner=0x55874b551bf0, ec=..., bytes_transferred=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#14 boost::asio::detail::scheduler::do_run_one (this=0x55874b551bf0, lock=..., this_thread=..., ec=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:492
#15 boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] (this=0x55874b551bf0, ec=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:208
#16 0x0000558710f19277 in boost::asio::io_context::run (this=0x55874ca43598)
--Type <RET> for more, q to quit, c to continue without paging--c
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.ipp:71
#17 CoroTest::TestBody (this=0x55874ca43580) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/common_tests.h:226
#18 0x0000558710fc4082 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x558710fe3a9d "the test body", 
    object=0x55874ca43580, method=<optimized out>) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2653
#19 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [clone .constprop.0] (
    object=0x55874ca43580, method=<optimized out>, location=0x558710fe3a9d "the test body")
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2689
#20 0x0000558710fb5163 in testing::Test::Run (this=0x55874ca43580)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2728
#21 testing::Test::Run (this=0x55874ca43580) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2718
#22 0x0000558710fb541d in testing::TestInfo::Run (this=0x55874b5d50c0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2874
#23 0x0000558710fc3e1a in testing::TestSuite::Run (this=0x55874b564700)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:3052
#24 0x0000558710fbe89a in testing::TestSuite::Run (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:3007
#25 testing::internal::UnitTestImpl::RunAllTests (this=this@entry=0x55874b585be0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:6004
#26 0x0000558710fbeee0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x558710fecd90 "auxiliary test code (environments or event listeners)", object=0x55874b585be0, method=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2642
#27 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x558710fecd90 "auxiliary test code (environments or event listeners)", object=0x55874b585be0, method=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2689
#28 testing::UnitTest::Run (this=<optimized out>) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:5583
#29 0x0000558710f1326d in RUN_ALL_TESTS () at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/include/gtest/gtest.h:2334
#30 main (argc=<optimized out>, argv=0x7fffca0eb058) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googlemock/src/gmock_main.cc:71
...
...
(gdb) list 185,200
185      readioc.set_read_snap(neorados::snap_dir);
186    
187      co_await new_selfmanaged_snap(rados(), my_snaps, ioc);
188      const auto bl1 = filled_buffer_list(0xcc, len);
189      co_await execute(oid, WriteOp{}.write(0, bl1), ioc);
190      co_await execute(oid, WriteOp{}.write(len, bl1), ioc);
191      co_await execute(oid, WriteOp{}.write(len * 2, bl1), ioc);
192    
193      neorados::SnapSet ss;
194      co_await execute(oid, ReadOp{}.list_snaps(&ss), readioc);
195      EXPECT_EQ(1u, ss.clones.size());
196      EXPECT_EQ(neorados::snap_head, ss.clones[0].cloneid);
197      EXPECT_EQ(0u, ss.clones[0].snaps.size());
198      EXPECT_EQ(0u, ss.clones[0].overlap.size());
199      EXPECT_EQ(len * 3, ss.clones[0].size);
200    
(gdb) info locals
my_snaps = std::vector of length 1, capacity 1 = {2}
ioc = {static impl_size = 128, impl = {data = {149, 0, 0, 0, 0, 0, 0, 0, 8, 213, 185, 76, 135, 85, 0 <repeats 18 times>, 107, 101, 121, 0, 95, 111, 115, 100, 40, 
      213, 185, 76, 135, 85, 0 <repeats 18 times>, 40, 0, 0, 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 254, 255, 255, 255, 255, 255, 255, 255, 2, 0, 0, 
      0, 0, 0, 0, 0, 224, 182, 187, 76, 135, 85, 0, 0, 232, 182, 187, 76, 135, 85, 0, 0, 232, 182, 187, 76, 135, 85, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}}
readioc = {static impl_size = 128, impl = {data = {149, 0, 0, 0, 0, 0, 0, 0, 136, 213, 185, 76, 135, 85, 0 <repeats 18 times>, 160, 213, 185, 76, 135, 85, 0, 0, 168, 
      213, 185, 76, 135, 85, 0 <repeats 26 times>, 255 <repeats 16 times>, 0 <repeats 40 times>}}}
bl1 = {_buffers = {_root = {next = 0x55874ca622b0}, _tail = 0x55874ca622b0}, _carriage = 0x55871102e760 <ceph::buffer::v15_2_0::list::always_empty_bptr>, _len = 128, 
  _num = 1, static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, 
      _len = 0}, <No data fields>}}
ss = {clones = std::vector of length 350204646, capacity 350204656 = {<error reading variable: Cannot access memory at address 0x558214cd1c0d>
bl2 = {_buffers = {_root = {next = 0x55874cb9d640}, _tail = 0x8}, _carriage = 0x72676d5f73706163, _len = 0, _num = 0, 
  static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}
resbl = {_buffers = {_root = {next = 0x55874cb9d660}, _tail = 0x7}, _carriage = 0x7220776f6c6c61, _len = 0, _num = 0, 
  static always_empty_bptr = {<ceph::buffer::v15_2_0::ptr_hook> = {next = 0x0}, <ceph::buffer::v15_2_0::ptr> = {_raw = 0x0, _off = 0, _len = 0}, <No data fields>}}
_Coro_resume_fn = 0x558710f20c20 <NeoRadosSelfManagedSnaps_Rollback_Test::CoTestBody(_ZN38NeoRadosSelfManagedSnaps_Rollback_Test10CoTestBodyEv.Frame *)>
_Coro_destroy_fn = 0x558710f24e00 <NeoRadosSelfManagedSnaps_Rollback_Test::CoTestBody(_ZN38NeoRadosSelfManagedSnaps_Rollback_Test10CoTestBodyEv.Frame *)>
this = 0x55874ca43580
_Coro_promise = {<boost::asio::detail::awaitable_frame_base<boost::asio::any_io_executor>> = {coro_ = {_M_fr_ptr = 0x55874cb9d470}, 
    attached_thread_ = 0x7fffca0ea7d0, caller_ = 0x7f172805ff80, pending_exception_ = {_M_exception_object = 0x0}, 
    resume_context_ = 0x7fffca0ea740}, <No data fields>}
_Coro_self_handle = {_M_fr_ptr = 0x55874cb9d470}
...
...
(gdb) p ss
$1 = {clones = std::vector of length 350204646, capacity 350204656 = {<error reading variable: Cannot access memory at address 0x558214cd1c0d>

I got this backtrace for the second core file:

$ gdb /usr/bin/ceph_test_neorados_read_operations -c 1762460774.41434.core -d /root/rpmbuild/BUILD/ceph-20.3.0-3904-ge96cbb7d
...
...
Core was generated by `ceph_test_neorados_read_operations --gtest_output=xml:/home/ubuntu/cephtest/arc'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  NeoRadosReadOps_OmapNuls_Test::CoTestBody(_ZN29NeoRadosReadOps_OmapNuls_Test10CoTestBodyEv.Frame *) (frame_ptr=0x55b425a2a750)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/read_operations.cc:546
546        while (truncated) {
[Current thread is 1 (LWP 41434)]
(gdb) bt
#0  NeoRadosReadOps_OmapNuls_Test::CoTestBody(_ZN29NeoRadosReadOps_OmapNuls_Test10CoTestBodyEv.Frame *) (frame_ptr=0x55b425a2a750)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/read_operations.cc:546
#1  0x000055b3edfa7216 in std::__n4861::coroutine_handle<void>::resume (this=<optimized out>) at /usr/include/c++/14/coroutine:137
#2  boost::asio::detail::awaitable_frame_base<boost::asio::any_io_executor>::resume (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/awaitable.hpp:499
#3  boost::asio::detail::awaitable_thread<boost::asio::any_io_executor>::pump (this=0x7ffed47af930)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/awaitable.hpp:769
#4  0x000055b3edfa714b in boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > >::operator()<boost::system::error_code> (this=0x7ffed47af930)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/consign.hpp:49
#5  boost::asio::detail::any_completion_handler_impl<boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > > >::call<boost::system::error_code> (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/any_completion_handler.hpp:190
#6  boost::asio::detail::any_completion_handler_call_fn<void (boost::system::error_code)>::impl<boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code>, std::pair<boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void>, std::shared_ptr<neorados::detail::Client> > > >(boost::asio::detail::any_completion_handler_impl_base*, boost::system::error_code) (
    impl=<optimized out>, args#0=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/any_completion_handler.hpp:220
#7  0x00007fdaab6b4fac in void boost::asio::detail::executor_function::complete<boost::asio::detail::binder0<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code> >, std::allocator<void> >(boost::asio::detail::executor_function::impl_base*, bool) ()
   from /usr/lib64/ceph/libceph-common.so.2
#8  0x000055b3edfa7f38 in boost::asio::detail::executor_function::operator() (this=<synthetic pointer>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/executor_function.hpp:61
#9  boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>::execute<boost::asio::detail::executor_function> (this=0x7fd99003c380, f=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.hpp:192
#10 0x00007fdaab6b1e88 in boost::asio::detail::work_dispatcher<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code>, boost::asio::any_completion_executor, void>::operator()() () from /usr/lib64/ceph/libceph-common.so.2
#11 0x00007fdaab6b6caa in boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<boost::asio::detail::append_handler<boost::asio::any_completion_handler<void (boost::system::error_code)>, boost::system::error_code>, boost::asio::any_completion_executor, void>, boost::asio::any_completion_handler_allocator<void, void (boost::system::error_code)>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () from /usr/lib64/ceph/libceph-common.so.2
#12 0x000055b3ee05873d in boost::asio::detail::scheduler_operation::complete (this=0x7fd99009a180, owner=0x55b423b8c0e0, ec=..., bytes_transferred=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/scheduler_operation.hpp:40
#13 boost::asio::detail::scheduler::do_run_one (this=0x55b423b8c0e0, lock=..., this_thread=..., ec=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:492
#14 boost::asio::detail::scheduler::run(boost::system::error_code&) [clone .constprop.0] (this=0x55b423b8c0e0, ec=...)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/detail/impl/scheduler.ipp:208
#15 0x000055b3edf8772c in boost::asio::io_context::run (this=0x55b4258d7918)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/redhat-linux-build/boost/include/boost/asio/impl/io_context.ipp:71
#16 CoroTest::TestBody (this=0x55b4258d7900) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/test/neorados/common_tests.h:226
--Type <RET> for more, q to quit, c to continue without paging--
#17 0x000055b3ee038162 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x55b3ee09b90c "the test body", 
    object=0x55b4258d7900, method=<optimized out>) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2653
#18 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [clone .constprop.0] (
    object=0x55b4258d7900, method=<optimized out>, location=0x55b3ee09b90c "the test body")
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2689
#19 0x000055b3ee0293a3 in testing::Test::Run (this=0x55b4258d7900)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2728
#20 testing::Test::Run (this=0x55b4258d7900) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2718
#21 0x000055b3ee02965d in testing::TestInfo::Run (this=0x55b423b7cbf0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2874
#22 0x000055b3ee037efa in testing::TestSuite::Run (this=0x55b423bb5700)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:3052
#23 0x000055b3ee032c9a in testing::TestSuite::Run (this=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:3007
#24 testing::internal::UnitTestImpl::RunAllTests (this=this@entry=0x55b423bb0be0)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:6004
#25 0x000055b3ee0332e0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x55b3ee0a53c8 "auxiliary test code (environments or event listeners)", object=0x55b423bb0be0, method=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2642
#26 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
    location=0x55b3ee0a53c8 "auxiliary test code (environments or event listeners)", object=0x55b423bb0be0, method=<optimized out>)
    at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:2689
#27 testing::UnitTest::Run (this=<optimized out>) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/src/gtest.cc:5583
#28 0x000055b3edf8176d in RUN_ALL_TESTS () at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googletest/include/gtest/gtest.h:2334
#29 main (argc=<optimized out>, argv=0x7ffed47b01b8) at /usr/src/debug/ceph-20.3.0-3904.ge96cbb7d.el10.x86_64/src/googletest/googlemock/src/gmock_main.cc:71
...
...
(gdb) list 535,555
535                    }));
536      }
537    
538      // Check iteration and truncation
539      {
540        std::unordered_set<std::string> keys;
541        for (const auto& [key, value] : omap) {
542          keys.insert(key);
543        }
544        bool truncated = true;
545        std::optional<std::string> lastkey;
546        while (truncated) {
547          ctnr::flat_set<std::string> keys2;
548          ctnr::flat_map<std::string, buffer::list> omap2;
549          bool truncated2;
550          ReadOp op;
551          op.get_omap_vals(lastkey, {}, 1, &omap2, &truncated);
552          op.get_omap_keys(lastkey, 1, &keys2, &truncated2);
553          co_await execute(oid, std::move(op));
554          EXPECT_EQ(1, std::ssize(keys2));
555          EXPECT_EQ(1, std::ssize(omap2));
...
...
(gdb) info locals
keys = std::unordered_set with 3 elements = {[0] = "3baa\000rr", [1] = "2baar", [2] = "1\000bar"}
truncated = true
lastkey = std::optional [no contained value]
omap = {m_flat_tree = {
    m_data = {<boost::container::dtl::flat_tree_value_compare<std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>, boost::container::dtl::select1st<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >> = {<std::binary_function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, <No data fields>}, m_seq = {
        m_holder = {<boost::container::new_allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list> >> = {<No data fields>}, m_start = 0x7fda800fd110, m_size = 3, m_capacity = 3}}}, static has_stored_allocator_type = true}}
_Coro_resume_fn = 0x55b3edf9a060 <NeoRadosReadOps_OmapNuls_Test::CoTestBody(_ZN29NeoRadosReadOps_OmapNuls_Test10CoTestBodyEv.Frame *)>
_Coro_destroy_fn = 0x55b3edf9ee50 <NeoRadosReadOps_OmapNuls_Test::CoTestBody(_ZN29NeoRadosReadOps_OmapNuls_Test10CoTestBodyEv.Frame *)>
this = 0x55b4258d7900
_Coro_promise = {<boost::asio::detail::awaitable_frame_base<boost::asio::any_io_executor>> = {coro_ = {_M_fr_ptr = 0x55b425a2a750}, 
    attached_thread_ = 0x7ffed47af930, caller_ = 0x7fd9a80a3ab0, pending_exception_ = {_M_exception_object = 0x0}, 
    resume_context_ = 0x7ffed47af8e0}, <No data fields>}
_Coro_self_handle = {_M_fr_ptr = 0x55b425a2a750}
_Coro_resume_index = 10
_Coro_frame_needs_free = true
_Coro_initial_await_resume_called = true

Actions #6

Updated by Laura Flores 4 months ago

It looks like both tests are using `co_await`, so it's possible that something in the C library changed between centos/ubuntu and rocky10.

@Adam Emerson WDYT?

Actions #7

Updated by Laura Flores 4 months ago

  • Status changed from New to In Progress
  • Assignee set to Adam Emerson

@Adam Emerson assigning to you.

Actions #8

Updated by Laura Flores 4 months ago

Bump up

Actions #9

Updated by Laura Flores 4 months ago

  • Priority changed from Normal to High
Actions #10

Updated by Radoslaw Zarzynski 4 months ago

It's a part of the Rocky 10 effort.

Actions #11

Updated by Adam Emerson 4 months ago

Just as an update I am currently investigating this and may have found a local reproducer that I'm hammering on.

Actions #12

Updated by Yaarit Hatuka 4 months ago

  • Blocks Bug #73930: ceph-mgr modules rely on deprecated python subinterpreters added
Actions #13

Updated by Casey Bodley 4 months ago

i wasn't able to reproduce the crashes until i added the rpm hardening flags used by our shaman package builds:

CXXFLAGS="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wno-complain-wrong-lang -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -m64 -march=x86-64-v3 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -mtls-dialect=gnu2"

of these, i narrowed the culprit down to -march=x86-64-v3. with that present, i see the same crashes on fedora 43 with gcc 15.2.1

cbodley@fedora ~/ceph/build $ gdb --args bin/ceph_test_neorados_read_operations --gtest_repeat=100 --gtest_filter=NeoRadosReadOps.Omap

Thread 1 "ceph_test_neora" received signal SIGSEGV, Segmentation fault.
0x000055555558700d in NeoRadosReadOps_Omap_Test::CoTestBody (frame_ptr=0x55555584e650)
    at /home/cbodley/ceph/src/test/neorados/read_operations.cc:412
412         std::optional<std::string> lastkey;
(gdb) disassemble /s
Dump of assembler code for function NeoRadosReadOps_Omap_Test::CoTestBody():
...
/home/cbodley/ceph/src/test/neorados/read_operations.cc:
411         bool truncated = true;
   0x0000555555586fcc <+4204>:  lea    0x20b8(%r15),%rax
   0x0000555555586fd3 <+4211>:  movb   $0x1,0x20b8(%r15)

412         std::optional<std::string> lastkey;
   0x0000555555586fdb <+4219>:  vpxor  %xmm0,%xmm0,%xmm0
   0x0000555555586fdf <+4223>:  lea    0x2400(%r15),%rbx
   0x0000555555586fe6 <+4230>:  mov    %rax,-0x4f0(%rbp)
   0x0000555555586fed <+4237>:  lea    0x2130(%r15),%rax
   0x0000555555586ff4 <+4244>:  lea    0x2118(%r15),%r13
   0x0000555555586ffb <+4251>:  movq   $0x0,0x20e0(%r15)

413         while (truncated) {
   0x0000555555587006 <+4262>:  mov    %rax,-0x4c0(%rbp)

412         std::optional<std::string> lastkey;
=> 0x000055555558700d <+4269>:  vmovdqa %ymm0,0x20c0(%r15)
   0x0000555555587016 <+4278>:  vzeroupper
...
(gdb) p $r15
$1 = 93824995354192
(gdb) p/a $r15
$2 = 0x55555584e650
(gdb) p lastkey
$3 = std::optional [no contained value]
(gdb) p &lastkey
$4 = (std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > *) 0x555555850710

from vmovdqa %ymm0,0x20c0(%r15), the r15 register corresponds to NeoRadosReadOps_Omap_Test::CoTestBody (frame_ptr=0x55555584e650), and 0x20c0 is the offset to the lastkey variable. the size of optional<string> here is 40 bytes, so its end offset is 0x20e8. so this instruction is copying the 32-byte ymm0 register into the low 32 bytes of memory for lastkey

vpxor %xmm0,%xmm0,%xmm0 above is zeroing the 16 byte xmm0 register which corresponds to the low 16 bytes of ymm0, but the high 16 bytes of ymm0 may be uninitialized? the vzeroupper instruction would zero those, but it comes after vmovdqa. regardless, that instruction shouldn't crash on uninitialized bits in the register. it's just copying those into lastkey's memory address, which should be a valid pointer to stack memory

movq $0x0,0x20e0(%r15) is what zeroes the final 8 bytes of lastkey

the same test without -march specified doesn't crash, with code gen:

/home/cbodley/ceph/src/test/neorados/read_operations.cc:
411         bool truncated = true;
   0x00005555555a8f30 <+7808>:  lea    0x2130(%r14),%rax
   0x00005555555a8f37 <+7815>:  movb   $0x1,0x20b8(%r14)

412         std::optional<std::string> lastkey;
=> 0x00005555555a8f3f <+7823>:  pxor   %xmm0,%xmm0
   0x00005555555a8f43 <+7827>:  lea    0x2400(%r14),%rbx
   0x00005555555a8f4a <+7834>:  mov    %rax,-0x388(%rbp)
   0x00005555555a8f51 <+7841>:  lea    0x2118(%r14),%r12
   0x00005555555a8f58 <+7848>:  movq   $0x0,0x20e0(%r14)

413         while (truncated) {
   0x00005555555a8f63 <+7859>:  movaps %xmm0,0x20c0(%r14)
   0x00005555555a8f6b <+7867>:  movaps %xmm0,0x20d0(%r14)

the same test on the tentacle branch does not crash with -march=x86-64-v3, so i'm working to bisect

Actions #14

Updated by Casey Bodley 4 months ago

i started bisect at commit 83a82c51682caafaea5cd9ccf8e77b7250448c81, just before an early March pr https://github.com/ceph/ceph/pull/61084 which bumped boost to 1.87:

$ git reset --hard origin/main
$ git bisect start
$ git bisect bad
$ git bisect good 83a82c51682caafaea5cd9ccf8e77b7250448c81
Bisecting: 2893 revisions left to test after this (roughly 12 steps)
[bc7600417e9a8e60225f66fa4b3fca14f6e8af3f] Merge pull request #64069 from phlogistonjohn/jjm-bwc-test-tweak
$ git bisect bad
Bisecting: 1440 revisions left to test after this (roughly 11 steps)
warning: unable to rmdir 'src/breakpad': Directory not empty
warning: unable to rmdir 'src/lss': Directory not empty
[84a42f7c76b19f9532136b22d4a64b2aad8b3257] Merge pull request #56336 from pritha-srivastava/wip-rgw-d4n-next
$ git bisect good
Bisecting: 720 revisions left to test after this (roughly 10 steps)
[ef03debd4c6fc1e3866f6daed1cf449050fbb191] Merge pull request #63629 from zdover23/wip-doc-2025-06-02-mgr-localpool-63419-followup
$ git bisect good
Bisecting: 359 revisions left to test after this (roughly 9 steps)
[94478005e72c0f9ca496b828c439b4843c88f3a0] Merge pull request #58881 from cbodley/wip-gcc-13-lto
$ git bisect bad
Bisecting: 180 revisions left to test after this (roughly 8 steps)
[406a8e8e7643fa1735d6d354ce25c33a3a5ffe7a] Merge pull request #63160 from yuvalif/wip-yuval-71219
$ git bisect good
Bisecting: 90 revisions left to test after this (roughly 7 steps)
[ed2b694e4d9479eb7a9bef84863d5e9b253f4099] Merge pull request #63975 from tchaikov/wip-cmake-find_program
$ git bisect good
Bisecting: 45 revisions left to test after this (roughly 6 steps)
[6aa6c77961548a14b058e81507f3b93205955095] Merge pull request #64007 from tchaikov/wip-update-ceph-object-corpus
$ git bisect good
Bisecting: 19 revisions left to test after this (roughly 5 steps)
[db2a4a672e6622481e75e7a9d96744e9ab9faec5] Merge pull request #63149 from ajarr/wip-ajarr-fix-mirror-image-get-mode
$ git bisect good
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[18c98576fc2cb67e9ecaaa38ca1b935fa2f92fc3] Merge pull request #62568 from Matan-B/wip-matanb-fmt-11.1.4
$ git bisect good
Bisecting: 5 revisions left to test after this (roughly 2 steps)
[2cddd4d09178b69babbd2412672d0620932d786b] Merge pull request #64042 from tchaikov/wip-rgw-no-aligned_storage
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[ebceb95ffc1907014ad8d22344fac6de15eba3d2] Merge pull request #64037 from tchaikov/wip-neorados-alignedas
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[ba7b42983cc6e1966f9149cc6160a4ae6154f9e0] neorados: avoid using std::aligned_storage_t

which came from https://github.com/ceph/ceph/pull/64037 for c++23 support

Actions #15

Updated by Casey Bodley 4 months ago · Edited

[ba7b42983cc6e1966f9149cc6160a4ae6154f9e0] neorados: avoid using std::aligned_storage_t

after this change, alignof(neorados::Op) is significantly different. the aligned_storage template introduced by Kefu defaults to Alignment = std::bit_ceil(S), which gives 1024 for S = 680

but the original std::aligned_storage_t<680> defaults to 16-byte alignment:

ceph/src/include/neorados/RADOS.hpp:393:37: warning: ‘using std::aligned_storage_t = struct std::aligned_storage<680, 16>::typ
e’ is deprecated [-Wdeprecated-declarations]                
  393 |   std::aligned_storage_t<impl_size> impl;                                                                                           
      |                                     ^~~~

the crash goes away if i specify the alignment as 16 (which also matches alignof(OpImpl)):

   static constexpr std::size_t impl_size = 85 * 8;
-  detail::aligned_storage<impl_size> impl;
+  detail::aligned_storage<impl_size, 16> impl;

from vmovdqa %ymm0,0x20c0(%r15), the r15 register corresponds to NeoRadosReadOps_Omap_Test::CoTestBody (frame_ptr=0x55555584e650), and 0x20c0 is the offset to the lastkey variable. the size of optional<string> here is 40 bytes, so its end offset is 0x20e8. so this instruction is copying the 32-byte ymm0 register into the low 32 bytes of memory for lastkey

vpxor %xmm0,%xmm0,%xmm0 above is zeroing the 16 byte xmm0 register which corresponds to the low 16 bytes of ymm0, but the high 16 bytes of ymm0 may be uninitialized? the vzeroupper instruction would zero those, but it comes after vmovdqa. regardless, that instruction shouldn't crash on uninitialized bits in the register. it's just copying those into lastkey's memory address, which should be a valid pointer to stack memory

@Shilpa MJ pointed out that vmovdqa on a 32-byte register probably expects the memory to have 32-byte alignment, but the address of lastkey (0x555555850710) only has 16-byte alignment

@Matt Benjamin found https://www.cs.ubbcluj.ro/~vancea/asc/practic/html/MOVDQA.html which says for VMOVDQA,

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction.

so i assume the use of VMOVDQA over VMOVDQU is due to a compiler bug?

edit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104177 looks potentially related, specific to coroutine frames

edit2: that bug talks about "extended alignment" which is greater than alignof(std::max_align_t) = 16

Actions #16

Updated by Casey Bodley 4 months ago

  • Pull request ID set to 66396
Actions #17

Updated by Laura Flores 4 months ago

Scrub note: Approved, but needs to be tested.

Actions #18

Updated by Laura Flores 4 months ago

  • Related to QA Run #73749: wip-lflores-testing-4-2025-12-01-1527 (old wip-rocky10-branch-of-the-day-2025-11-05-1762369819) added
Actions #19

Updated by Casey Bodley 4 months ago

Casey Bodley wrote in #note-15:

edit: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104177 looks potentially related, specific to coroutine frames

edit2: that bug talks about "extended alignment" which is greater than alignof(std::max_align_t) = 16

digging further, it sounds like gcc is conforming to the c++ standard which mandates the use of a specific (unaligned) overload of operator new for coroutine frames. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2014r2.html proposes the use of std::align_val_t overloads of operator new/delete for coroutine frames that require extended alignment. however, there's been no movement on that paper since 2024 and according to https://github.com/cplusplus/papers/issues/750#issuecomment-2657897866,

Author says the paper is no longer pursued.

i guess we just need to be very careful about inventing/using types with "extended alignment"

Actions #20

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from In Progress to Fix Under Review
Actions #21

Updated by Laura Flores 3 months ago

Note from bug scrub: QA in progress.

Actions #22

Updated by Laura Flores 3 months ago

Scrub note: Checking with Nitzan about test results for this.

Actions #23

Updated by Yaarit Hatuka 3 months ago

QA run results in https://tracker.ceph.com/issues/74070 might be related, need to take a look.

Actions #24

Updated by Laura Flores 2 months ago

Scrub note: QA evaluation delayed due to lab migration.

Actions #25

Updated by Yaarit Hatuka about 1 month ago

  • Has duplicate Bug #73758: rocky 10: test_rgw_datalog.sh fails with segfault added
Actions #26

Updated by Upkeep Bot about 6 hours ago

  • Status changed from Fix Under Review to Resolved
  • Merge Commit set to 1330473ff40fa90dbbd11ff75a20bf27cc262e4c
  • Fixed In set to v20.3.0-6282-g1330473ff4
  • Upkeep Timestamp set to 2026-03-20T21:51:33+00:00
Actions

Also available in: Atom PDF