Skip to content

Breakpad Crash Handler#61264

Merged
chardan merged 7 commits intoceph:mainfrom
irq0:wip/breakpad
Jun 2, 2025
Merged

Breakpad Crash Handler#61264
chardan merged 7 commits intoceph:mainfrom
irq0:wip/breakpad

Conversation

@irq0
Copy link
Member

@irq0 irq0 commented Jan 8, 2025

This adds Google Breakpad as another option to handle crashes.

In terms of crash dump fidelity it is close to coredumps+system metadata, but at a fraction of the size. Typically 10s of MB.

Why Breakpad?

  1. The generated minidumps are smaller, but still contain valuable data like all thread backtraces
  2. It is widely adopted by OpenTTD, Chromium, Firefox, and many more
  3. There is great tooling around processing minidumps ranging from Sentry to the https://github.com/rust-minidump suite of command line tools

Demo

https://asciinema.org/a/KMiIt0XKkgsE2s1kaRiuqMeUm

Demos the whole process from symbol file generation, crashing a daemon, to looking at the minidump.

Tools: dump_syms minidump-stackwalk
https://github.com/rust-minidump https://github.com/mozilla/dump_syms

References

https://chromium.googlesource.com/breakpad/breakpad/+/HEAD/docs/getting_started_with_breakpad.md

https://chromium.googlesource.com/breakpad/breakpad

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@irq0 irq0 requested a review from a team as a code owner January 8, 2025 13:43
@adamemerson
Copy link
Contributor

I think you have to put the word 'submodule' somewhere in the commit message where you update the submodule.

@adamemerson
Copy link
Contributor

jenkins test make check

@adamemerson
Copy link
Contributor

Overall I'm very much in favor of this once you get it compiling everywhere.

@irq0 irq0 mentioned this pull request Jan 10, 2025
14 tasks
@irq0
Copy link
Member Author

irq0 commented Jan 13, 2025

jenkins test make check

@irq0 irq0 requested a review from a team as a code owner January 13, 2025 15:02
@irq0
Copy link
Member Author

irq0 commented Jan 14, 2025

jenkins test make check

@irq0
Copy link
Member Author

irq0 commented Jan 16, 2025

Overall I'm very much in favor of this once you get it compiling everywhere.

Tests just passed 🥳. The last push added a build option and ifdefs.
The default is set to not build on windows.

@adamemerson
Copy link
Contributor

@mattbenjamin @chardan I like this idea and pulling in a better crash handler was something I'd been thinking of off and on. Would you take a look?

Copy link
Contributor

@chardan chardan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful facility to have! Thanks for your work on it. LGTM.

#ifdef HAVE_BREAKPAD
if (g_conf()->breakpad) {
google_breakpad::MinidumpDescriptor descriptor(g_conf()->crash_dir);
g_ceph_context->_ex_handler.reset(
Copy link
Contributor

@chardan chardan Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but also consider:
_ex_handler = std::make_unique<google_breakpad::ExceptionHandler>(...your args...);

@chardan
Copy link
Contributor

chardan commented Jan 17, 2025

One question I do have: should this have any documentation? An update to -D options listed somewhere?

@irq0
Copy link
Member Author

irq0 commented Jan 20, 2025

One question I do have: should this have any documentation? An update to -D options listed somewhere?

What kind of docs do you have in mind?

I could do a writeup of the demo linked above. This is however just one way to use the feature. A Sentry-based workflow would look quite different.

@chardan
Copy link
Contributor

chardan commented Jan 20, 2025

@irq0 I don't think we want to repeat the docs that come with the tool itself so much, so much as the install guide in README.md if there seemed to be an appropriate section to mention it. I didn't have anything too extensive in mind; and, lots of things are AFAIK currently discovered by looking through the build scripts.

@chardan
Copy link
Contributor

chardan commented Jan 20, 2025

@adamemerson @mattbenjamin Any other thoughts?

@chardan
Copy link
Contributor

chardan commented Jan 22, 2025

@irq0 I'm trying to get this into the test lab, but there are some technical issues. Thanks for your patience!

@irq0
Copy link
Member Author

irq0 commented Jan 23, 2025

@irq0 I'm trying to get this into the test lab, but there are some technical issues. Thanks for your patience!

Sure, let me know if I can help.

@chardan
Copy link
Contributor

chardan commented Jan 28, 2025

@adamemerson
Copy link
Contributor

Some of them are just vanilla failures.

g++ -DHAVE_CONFIG_H -I. -I./src  -I./src   -Wmissing-braces -Wnon-virtual-dtor -Woverloaded-virtual -Wreorder -Wsign-compare -Wunused-local-typedefs -Wunused-variable -Wvla -Werror -fPIC -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong   -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -MT src/common/string_conversion.o -MD -MP -MF $depbase.Tpo -c -o src/common/string_conversion.o src/common/string_conversion.cc &&\
mv -f $depbase.Tpo $depbase.Po
src/common/string_conversion.cc: In function ‘google_breakpad::UTF32ToUTF16Char(wchar_t, unsigned short*)’:
src/common/string_conversion.cc:107:16: error: array subscript 2 is outside array bounds of ‘wchar_t[1]’ [-Werror=array-bounds]
  107 |   const UTF32* source_end_ptr = source_ptr + 1;
      |                ^~~~~~~~~~~~~~
src/common/string_conversion.cc:105:31: note: while referencing ‘in’
  105 | void UTF32ToUTF16Char(wchar_t in, uint16_t out[2]) {
      |                       ~~~~~~~~^~
cc1plus: all warnings being treated as errors

Marcel Lauhoff and others added 6 commits May 9, 2025 17:06
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
Add option WITH_BREAKPAD defaulting to TRUE on non-windows platforms.

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
Only build breakpad code if we build with breakpad support. Print
warning if breakpad enabled in config, but not built.

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
Explicitly disable incompatible `-Warray-bounds` and
`-Wmaybe-uninitialized`. These aren't normally enabled as
compile-failing errors, except when building RPMs on CentOS9.

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
Test that with enabled Breakpad crash handling Ceph creates correct
minidumps.

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@clyso.com>
@chardan
Copy link
Contributor

chardan commented May 13, 2025

jenkins test api

@irq0
Copy link
Member Author

irq0 commented Jun 2, 2025

Is this good to merge?

@chardan
Copy link
Contributor

chardan commented Jun 2, 2025

@irq0 This looks good to me at this point; if you'd like to add a Death Test, I should point out that Catch2 is now supported, so you can do that if you would like to! (See cmake/modules/AddCephTest.cmake and test/rgw/CMakeLists.txt for examples).

@irq0
Copy link
Member Author

irq0 commented Jun 2, 2025

@irq0 This looks good to me at this point; if you'd like to add a Death Test, I should point out that Catch2 is now supported, so you can do that if you would like to!

The last push added the death test using gtest :)

@chardan chardan merged commit 5611721 into ceph:main Jun 2, 2025
12 checks passed
@chardan
Copy link
Contributor

chardan commented Jun 2, 2025

@irq0 Merged! Thank you for all of your work on this, and also for your considerable patience!

@irq0
Copy link
Member Author

irq0 commented Jun 2, 2025

@irq0 Merged! Thank you for all of your work on this, and also for your considerable patience!

Thanks! Any chance to get this into tentacle?

@chardan
Copy link
Contributor

chardan commented Jun 2, 2025

I don't think so, unfortunately.

@chardan chardan mentioned this pull request Jun 2, 2025
14 tasks
@cbodley
Copy link
Contributor

cbodley commented Jun 4, 2025

ubuntu builds on https://shaman.ceph.com/builds/ceph/main/ have been failing for a couple days with:

In file included from /build/ceph-20.3.0-762-gc71d7eeb/src/common/ceph_context.h:38,
                 from /build/ceph-20.3.0-762-gc71d7eeb/src/common/dout.h:29,
                 from /build/ceph-20.3.0-762-gc71d7eeb/src/java/native/libcephfs_jni.cc:33:
/build/ceph-20.3.0-762-gc71d7eeb/src/breakpad/src/client/linux/handler/exception_handler.h:39:10: fatal error: client/linux/crash_generation/crash_generation_client.h: No such file or directory
   39 | #include "client/linux/crash_generation/crash_generation_client.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

example logs:
https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=jammy,DIST=jammy,MACHINE_SIZE=gigantic/57721//consoleFull
https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=jammy,DIST=jammy,MACHINE_SIZE=gigantic/57737//consoleFull

@cbodley
Copy link
Contributor

cbodley commented Jun 4, 2025

@chardan
Copy link
Contributor

chardan commented Jun 4, 2025

@cbodley This is my fault, I didn't notice that and shouldn't have merged-- apologies for the confusion here, I'll work on being more thorough when the merge button is involved!

@ivancich
Copy link
Member

ivancich commented Jun 4, 2025

@chardan and @cbodley : Given that there were modifications to global_init.cc, shouldn't this have gone through teuthology prior to merge also? (I don't even know what suites are used for common code.)

@cbodley
Copy link
Contributor

cbodley commented Jun 4, 2025

@chardan and @cbodley : Given that there were modifications to global_init.cc, shouldn't this have gone through teuthology prior to merge also?

agreed, yes

I don't even know what suites are used for common code.

the rados suite is a good default for common changes - you can add the 'core' and 'needs-qa' labels to get picked up for testing/merging

@athanatos
Copy link
Contributor

athanatos commented Jun 4, 2025

This PR re-added an extra line to .gitmodules.

	shallow = true

This breaks the nvmeof submodule on fresh checkouts generally.

athanatos added a commit to athanatos/ceph that referenced this pull request Jun 4, 2025
ceph#61264 reintroduced
https://tracker.ceph.com/issues/67640 fixed by
383091e.

Setting shallow=true for the nvmeof/gateway submodule
is problematic because the ceph.git submodule sha1
is only very rarely the head sha1 of the default
branch.

Fixes: https://tracker.ceph.com/issues/71568
Signed-off-by: Samuel Just <sjust@redhat.com>
@ivancich
Copy link
Member

ivancich commented Jun 5, 2025

@chardan Did you see comment about re: need for teuthology run in this case?

@chardan
Copy link
Contributor

chardan commented Jun 5, 2025

@ivancich I've submitted it to shaman for build, then will run through teuthology. Might take a bit, things are backed up...

aainscow pushed a commit to aainscow/ceph that referenced this pull request Jun 10, 2025
ceph#61264 reintroduced
https://tracker.ceph.com/issues/67640 fixed by
383091e.

Setting shallow=true for the nvmeof/gateway submodule
is problematic because the ceph.git submodule sha1
is only very rarely the head sha1 of the default
branch.

Fixes: https://tracker.ceph.com/issues/71568
Signed-off-by: Samuel Just <sjust@redhat.com>
samarahu pushed a commit to samarahu/ceph that referenced this pull request Jun 25, 2025
ceph#61264 reintroduced
https://tracker.ceph.com/issues/67640 fixed by
383091e.

Setting shallow=true for the nvmeof/gateway submodule
is problematic because the ceph.git submodule sha1
is only very rarely the head sha1 of the default
branch.

Fixes: https://tracker.ceph.com/issues/71568
Signed-off-by: Samuel Just <sjust@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants