Conversation
6ec88a2 to
03564cb
Compare
03564cb to
d046a5c
Compare
I'll push the revert as a separate PR and address the comments here.
This should be reproducible on a vstart cluster with --crimson ( |
d046a5c to
8db56f6
Compare
| if (WIFSIGNALED(status)) { | ||
| int signal = WTERMSIG(status); | ||
| std::cerr << "get_early_config: child signaled " << strsignal(signal) << std::endl; | ||
| return tl::unexpected(-signal); |
There was a problem hiding this comment.
we are using exit(-ret.error()) in get_early_config(). so signal would be return as the status code of the crimson process, can we instead follow the UNIX convention by returning the status instead, so that the toolings like systemd can understand why a process exits?
There was a problem hiding this comment.
also, would be better to explain why we need this change, and how it's connected with ASan.
There was a problem hiding this comment.
we are using
exit(-ret.error())inget_early_config(). sosignalwould be return as the status code of the crimson process, can we instead follow the UNIX convention by returning thestatusinstead, so that the toolings like systemd can understand why a process exits?
@Matan-B hi Matan, could you please reconsider this comment?
There was a problem hiding this comment.
@tchaikov, please help me understand this case better, iiuc waitpid does not return a plain exit code.
It stores an encoded status value that must be interpreted using the provided macros (e.g WIFSIGNALED).
From the Linux man page:
If status is not NULL, wait() and waitpid() store status information in the int to which it points.
This integer can be inspected with the following macros
So returning the raw status value is not meaningful if we didn't decode it.
IIRC, tools like systemd can understand if the program exited normally or was killed by a signal (such as this case).
There was a problem hiding this comment.
@Matan-B You're absolutely correct that returning the raw status value wouldn't work - it's encoded and needs the macros to decode.
However, looking at the flow: when the child is killed by signal 11, you return tl::unexpected(11), which becomes the exit status of crimson-osd via main.cc:96. The issue is that exit status 11 doesn't clearly indicate "killed by signal 11" to external tools.
Good news: Ceph already has a standard pattern for this! Check out
Line 280 in a76c01c
ceph/src/common/fork_function.h
Line 45 in a76c01c
This follows the shell convention where signal termination is reported as 128 + signal_number. So for SIGSEGV (11), the exit status would be 139.
Suggestion: Change to return tl::unexpected(128 + signal) to match the existing Ceph convention and make the exit status meaningful to scripts and monitoring tools.
context-impl is moved into the per-module jam file in 1.87 which resulted in the following error when trying to set the context-impl: ``` Boost/tools/build/src/build/feature.jam:327: in validate-feature from module feature error: unknown feature "<context-impl> ``` context-impl setting was brought back in 1.90 boostorg/context@12ac945 context-impl is only set in our cmake files if WITH_ASAN is on. Hence, FTBFS, were not common so far. Signed-off-by: Matan Breizman <mbreizma@redhat.com>
on vstart+asan:
```
==764060==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 11696 byte(s) in 20 object(s) allocated from:
#0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
#1 0x7f840291ff54 in _PyObject_GC_NewVar (/lib64/libpython3.9.so.1.0+0x11ff54) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
Direct leak of 5892 byte(s) in 8 object(s) allocated from:
#0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
#1 0x7f840291aa22 in unicode_decode_utf8 (/lib64/libpython3.9.so.1.0+0x11aa22) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
Direct leak of 3872 byte(s) in 7 object(s) allocated from:
#0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
#1 0x7f84029df4e4 in _PyObject_Malloc.part.0 (/lib64/libpython3.9.so.1.0+0x1df4e4) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
Indirect leak of 3032 byte(s) in 4 object(s) allocated from:
#0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
#1 0x7f840291ff54 in _PyObject_GC_NewVar (/lib64/libpython3.9.so.1.0+0x11ff54) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
Indirect leak of 1776 byte(s) in 3 object(s) allocated from:
#0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
#1 0x7f84029df4e4 in _PyObject_Malloc.part.0 (/lib64/libpython3.9.so.1.0+0x1df4e4) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
```
Signed-off-by: Matan Breizman <mbreizma@redhat.com>
child Without this dedicated check, the bootstrap process would proceed to decoding the buffer from the child proccess - and only then would fail. This would result in a possibly misleading "failed to decode" instead of "child signaled". See the following example where ASAN caught a SIGSEV in the child process: ``` Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: AddressSanitizer:DEADLYSIGNAL Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: ================================================================= Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: SUMMARY: AddressSanitizer: stack-overflow (/usr/bin/ceph-osd-crimson+0x46e7a72) (BuildId: 2a86043f51c9be9cb19801e276fb3ee36239556a) in get_global_> Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: ==3==ABORTING Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: get_early_config: parent failed to decode Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: do_early_config returned error: -22 Jan 20 09:27:19 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: --------------------------------------- ``` Signed-off-by: Matan Breizman <mbreizma@redhat.com>
8db56f6 to
ca60f8a
Compare
Related fixes that helped with #67020
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.