Skip to content

common/ceph_context.h: reserve space for breakpad in CephContext#64829

Merged
aclamk merged 1 commit intoceph:mainfrom
aclamk:aclamk-jenkins-fix-make-check
Sep 12, 2025
Merged

common/ceph_context.h: reserve space for breakpad in CephContext#64829
aclamk merged 1 commit intoceph:mainfrom
aclamk:aclamk-jenkins-fix-make-check

Conversation

@aclamk
Copy link
Contributor

@aclamk aclamk commented Aug 5, 2025

For cases when HAVE_BREAKPAD is off, supply exactly the same space in CephContext struct.

While it should not happen, jenkins seems to link binaries with different variants.

The noticeable artefacts of this misbehaviour are:
208 - unittest_bluefs (Bus error)
209 - unittest_bluefs_ex (Failed)
211 - unittest_bdev (Bus error)

Above mentioned unittests are failing because
ceph_context.h :

ceph::PluginRegistry *get_plugin_registry() {
return _plugin_registry;
}
^ _plugin_registry returned is at !!!offset off by 8 bytes!!! to the location of _plugin_registry as constructed at ceph_context.cc :

743: _plugin_registry = new PluginRegistry(this);

This causes fatal error in
src/extblkdev/ExtBlkDevPlugin.cc :

227 auto registry = cct->get_plugin_registry();
228 std::lock_guard l(registry->lock);

Sometimes lock_guard hangs, sometimes lock_guard segfaults.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM apart a tiny nit.

std::unique_ptr<google_breakpad::ExceptionHandler> _ex_handler;
static_assert(sizeof(std::unique_ptr<google_breakpad::ExceptionHandler>) == sizeof(std::unique_ptr<char>));
#else
std::unique_ptr<char> _ex_handler;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would comment it's has its purpose. Perhaps `[[maybe_unused]]`` would be enough to communicate "do not remove!" to humans.

For cases when HAVE_BREAKPAD is off, supply exactly the same space in
CephContext struct.

While it should happen, jenkins seems to link binaries with different variants.

The noticeable artefacts of this misbehaviour are:
	208 - unittest_bluefs (Bus error)
	209 - unittest_bluefs_ex (Failed)
	211 - unittest_bdev (Bus error)

Above mentioned unittests are failing because
ceph_context.h :

  ceph::PluginRegistry *get_plugin_registry() {
    return _plugin_registry;
  }
^ _plugin_registry returned is at !!!offset off by 8 bytes!!! to the location of _plugin_registry as constructed at
ceph_context.cc :

743:   _plugin_registry = new PluginRegistry(this);

This causes fatal error in
src/extblkdev/ExtBlkDevPlugin.cc :

227      auto registry = cct->get_plugin_registry();
228      std::lock_guard l(registry->lock);

Sometimes lock_guard hangs, sometimes lock_guard segfaults.

Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
@aclamk aclamk force-pushed the aclamk-jenkins-fix-make-check branch from f4f9808 to 39af0bc Compare August 5, 2025 12:43
@tchaikov
Copy link
Contributor

tchaikov commented Aug 6, 2025

@aclamk is it possible to reproduce this issue? HAVE_BREAKPAD is defined in config.h, under what circumstances that we could have

  • two different config.h ?
  • a compilation unit which fails to include config.h, but it should have included it?

@tchaikov
Copy link
Contributor

tchaikov commented Aug 6, 2025

in the commit message:

While it should happen,

might want to put "While it should not happen". but still, i'd like to understand why it happens.

in the title of the commit:

common/ceph_context.h: Jenkins builder fix: breakpad reserve space

i'd suggest put: "common/ceph_context.h: reserve space for breakpad in CephContext`

@aclamk aclamk changed the title common/ceph_context.h: Jenkins builder fix: breakpad reserve space common/ceph_context.h: reserve space for breakpad in CephContext Aug 7, 2025
@aclamk
Copy link
Contributor Author

aclamk commented Aug 7, 2025

@aclamk is it possible to reproduce this issue? HAVE_BREAKPAD is defined in config.h, under what circumstances that we could have

* two different `config.h` ?

* a compilation unit which fails to include `config.h`, but it should have included it?

@tchaikov
I have not been able to replicate corrupted builds. I expect it is the case of not including config.h.
I have tried to locate such compilation unit by inserting #error into ceph_context.h when HAVE_BREAKPAD is not defined.
But jenkins always compiled it without problems.

@neha-ojha neha-ojha added the core label Aug 18, 2025
@idryomov
Copy link
Contributor

Above mentioned unittests are failing because
ceph_context.h :

@aclamk Just curious, if you weren't able to reproduce, how did you arrive at this conclusion from

208 - unittest_bluefs (Bus error)
209 - unittest_bluefs_ex (Failed)
211 - unittest_bdev (Bus error)

Is there a way to extract a core dump along with the matching binaries from Jenkins?

@dmick
Copy link
Member

dmick commented Aug 18, 2025

It certainly makes me very nervous not to have a root cause. Who knows what will break next?

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is tracked by https://tracker.ceph.com/issues/71547. and i have already a pull request addressing it, see #64273.

recently, the test failure surfaced three times in a row when testing another pull request at #65075

after including #64273 in #65075, the test result is now green again. see

based on the observation above, i'm inclined to reject this change.

@idryomov
Copy link
Contributor

recently, the test failure surfaced three times in a row when testing another pull request at #65075

* https://jenkins.ceph.com/job/ceph-pull-requests/165332/

* https://jenkins.ceph.com/job/ceph-pull-requests/165281/

* https://jenkins.ceph.com/job/ceph-pull-requests/165325/

after including #64273 in #65075, the test result is now green again. see

* https://jenkins.ceph.com/job/ceph-pull-requests/165334/

All of these jobs are x86, but #64273 talks exclusively about arm64. Are you sure the green job isn't just a coincidence?

@aclamk
Copy link
Contributor Author

aclamk commented Aug 26, 2025

@idryomov

Just curious, if you weren't able to reproduce, how did you arrive at this conclusion from

I logged in to specific jenkins host that was doing "make check" hanged doing one of the unittests, and did:

  1. rerun the binary = I got it hanged
  2. opened gdb session and inspected offsets of elements retrieved via direct access to ceph_context vs offsets retireved by accessor functions
    Obviously, it was only possible if jenkins runner had at least one of those unit tests hanged, otherwise I wasn't able to log in in time.

@idryomov
Copy link
Contributor

I logged in to specific jenkins host that was doing "make check"

Ah, I didn't realize this was allowed/possible. Thanks for the explanation!

@aclamk
Copy link
Contributor Author

aclamk commented Aug 27, 2025

@idryomov

Ah, I didn't realize this was allowed/possible. Thanks for the explanation!

Not really, one needs blessing (and access) from David Galloway.

@cbodley
Copy link
Contributor

cbodley commented Aug 27, 2025

The noticeable artefacts of this misbehaviour are:
208 - unittest_bluefs (Bus error)
209 - unittest_bluefs_ex (Failed)
211 - unittest_bdev (Bus error)

these load libceph-common.so as a shared library. so if something is installing ceph system packages, it's possible that the (pre-breakpad) system version gets loaded instead. but surely that would break lots of other tests - any theories why only these targets seem to be effected?

@djgalloway
Copy link
Contributor

The noticeable artefacts of this misbehaviour are:
208 - unittest_bluefs (Bus error)
209 - unittest_bluefs_ex (Failed)
211 - unittest_bdev (Bus error)

these load libceph-common.so as a shared library. so if something is installing ceph system packages

Like ceph-libboost*?

@cbodley
Copy link
Contributor

cbodley commented Aug 27, 2025

Like ceph-libboost*?

@djgalloway i was thinking https://packages.ubuntu.com/jammy/main/ceph-common, which provides a libceph-common.so from the quincy release long before the WITH_BREAKPAD stuff was added

@dmick
Copy link
Member

dmick commented Aug 27, 2025

The installed package version and size/etc. of libcommon can certainly be logged

@aclamk
Copy link
Contributor Author

aclamk commented Aug 29, 2025

@cbodley

these load libceph-common.so as a shared library. so if something is installing ceph system packages, it's possible that the (pre-breakpad) system version gets loaded instead. but surely that would break lots of other tests - any theories why only these targets seem to be effected?

I remember I checked shared libraries. I think I would have noticed linking against system ceph libraries.
We can easily verify it when "make check" is running.

@aclamk
Copy link
Contributor Author

aclamk commented Aug 29, 2025

@cbodley
PR #65303 failing on unittest_bluefs:

ldd ./bin/unittest_bluefs
	linux-vdso.so.1 (0x00007fff1b90b000)
	....
	libfuse.so.2 => /lib/x86_64-linux-gnu/libfuse.so.2 (0x00007f6b9d1b5000)
	libceph-common.so.2 => /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2 (0x00007f6b9a966000)
	...
	libboost_thread.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_thread.so.1.87.0 (0x00007f6b9a7fc000)
	libboost_program_options.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_program_options.so.1.87.0 (0x00007f6b9a7b7000)
	libboost_date_time.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_date_time.so.1.87.0 (0x00007f6b9a7b2000)
	libboost_iostreams.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_iostreams.so.1.87.0 (0x00007f6b9a79a000)
	libboost_random.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_random.so.1.87.0 (0x00007f6b9a791000)
	libboost_system.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_system.so.1.87.0 (0x00007f6b9a78a000)
	libboost_regex.so.1.87.0 => /opt/ceph/lib/x86_64-linux-gnu/libboost_regex.so.1.87.0 (0x00007f6b9a740000)
	libblkid.so.1 => /lib/x86_64-linux-gnu/libblkid.so.1 (0x00007f6b9a709000)
	...

So it fails on libceph-common.so it compiled itself.

@lee-j-sanders
Copy link
Member

This PR was tested as part of QA Run:
https://tracker.ceph.com/issues/72627

Unfortunately there were quite a few new failures as documented in the wiki here:
https://tracker.ceph.com/projects/rados/wiki/MAIN

New Issues raised:
8451410 - https://tracker.ceph.com/issues/72873 - rados/singleton-nomsgr - test_health_warnings.sh - PG 1.5 is not active+clean
8451391 - https://tracker.ceph.com/issues/72871 - rados/thrash-old-clients - [cephadm ERROR orchestrator._interface] Command timed out on host cephadm deploy (osd daemon) (default 900 second timeout)
8451624 - https://tracker.ceph.com/issues/72874 - rados/thrash-old-clients Stuck doing _try_send injecting socket failure and then nothing for 8 hours
8451462 - https://tracker.ceph.com/issues/72888 - rados/singleton-bluestore cluster [ERR] overall HEALTH_ERR 1 auth entities have invalid capabilities
8451467 - https://tracker.ceph.com/issues/72889 - rados/basic CephSQLiteTest.InsertBulk4096 hung for 8 hours
8451470 - https://tracker.ceph.com/issues/72890 - rados/thrash-old-clients rados/thrash-old-clients cluster create and lots of scrubs then timed out
8451510 - https://tracker.ceph.com/issues/72891 - rados/thrash-erasure-code-overwrites ceph pg dump hung for 2 minutes
8451583 - https://tracker.ceph.com/issues/72892 - rados/cephadm rm-cluster hung after mgr daemon was recovered

@rzarzynski @ljflores fyi

@lee-j-sanders
Copy link
Member

lee-j-sanders commented Sep 10, 2025

New trackers analysed, new issues unrelated

Rados approved: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrackercephcomissues72627

@yuriw
Copy link
Contributor

yuriw commented Sep 10, 2025

@aclamk I can't merge it till @tchaikov comments resolved

@yuriw
Copy link
Contributor

yuriw commented Sep 11, 2025

@aclamk pls merge at will when @tchaikov comments are resolved
ref: https://tracker.ceph.com/issues/72627

@aclamk aclamk dismissed tchaikov’s stale review September 12, 2025 12:07

The mentioned race on ARM64 cannot affect underlying offset mismatch.

@aclamk aclamk merged commit 0473c8f into ceph:main Sep 12, 2025
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.