Skip to content

cmake,qa,crimson: misc ASAN usage fixes#66994

Open
Matan-B wants to merge 3 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-asan-fixes
Open

cmake,qa,crimson: misc ASAN usage fixes#66994
Matan-B wants to merge 3 commits intoceph:mainfrom
Matan-B:wip-matanb-crimson-asan-fixes

Conversation

@Matan-B
Copy link
Contributor

@Matan-B Matan-B commented Jan 20, 2026

Related fixes that helped with #67020

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@Matan-B Matan-B requested review from a team as code owners January 20, 2026 13:41
@Matan-B Matan-B requested a review from shraddhaag January 20, 2026 13:41
@Matan-B Matan-B requested a review from tchaikov January 20, 2026 13:41
@Matan-B Matan-B force-pushed the wip-matanb-crimson-asan-fixes branch from 6ec88a2 to 03564cb Compare January 20, 2026 13:47
@Matan-B Matan-B requested review from athanatos and ronen-fr January 20, 2026 13:58
@Matan-B Matan-B force-pushed the wip-matanb-crimson-asan-fixes branch from 03564cb to d046a5c Compare January 20, 2026 14:01
@Matan-B Matan-B marked this pull request as draft January 20, 2026 17:13
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suggest split this pull request into smaller ones. i am fine reverting 1ab0a8c -- i will take a closer look to see how to address it.

@Matan-B do you happen to know how to reproduce this?

@Matan-B
Copy link
Contributor Author

Matan-B commented Jan 21, 2026

i'd suggest split this pull request into smaller ones. i am fine reverting 1ab0a8c -- i will take a closer look to see how to address it.

I'll push the revert as a separate PR and address the comments here.

@Matan-B do you happen to know how to reproduce this?

This should be reproducible on a vstart cluster with --crimson (can confirm later See: https://tracker.ceph.com/issues/74481) - the backtrace is not from a vstart cluster. In case in helps, the problematic user was _get_early_config.

@Matan-B Matan-B force-pushed the wip-matanb-crimson-asan-fixes branch from d046a5c to 8db56f6 Compare January 21, 2026 12:38
@Matan-B Matan-B changed the title crimson/osd: Fix stack-overflow on deploy cmake,qa,crimson: misc ASAN usage fixes Jan 21, 2026
@Matan-B Matan-B requested a review from tchaikov January 21, 2026 12:41
@Matan-B Matan-B marked this pull request as ready for review January 21, 2026 12:41
if (WIFSIGNALED(status)) {
int signal = WTERMSIG(status);
std::cerr << "get_early_config: child signaled " << strsignal(signal) << std::endl;
return tl::unexpected(-signal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using exit(-ret.error()) in get_early_config(). so signal would be return as the status code of the crimson process, can we instead follow the UNIX convention by returning the status instead, so that the toolings like systemd can understand why a process exits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, would be better to explain why we need this change, and how it's connected with ASan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using exit(-ret.error()) in get_early_config(). so signal would be return as the status code of the crimson process, can we instead follow the UNIX convention by returning the status instead, so that the toolings like systemd can understand why a process exits?

@Matan-B hi Matan, could you please reconsider this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov, please help me understand this case better, iiuc waitpid does not return a plain exit code.
It stores an encoded status value that must be interpreted using the provided macros (e.g WIFSIGNALED).
From the Linux man page:

If status is not NULL, wait() and waitpid() store status information in the int to which it points.
This integer can be inspected with the following macros 

So returning the raw status value is not meaningful if we didn't decode it.
IIRC, tools like systemd can understand if the program exited normally or was killed by a signal (such as this case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matan-B You're absolutely correct that returning the raw status value wouldn't work - it's encoded and needs the macros to decode.

However, looking at the flow: when the child is killed by signal 11, you return tl::unexpected(11), which becomes the exit status of crimson-osd via main.cc:96. The issue is that exit status 11 doesn't clearly indicate "killed by signal 11" to external tools.

Good news: Ceph already has a standard pattern for this! Check out

return 128 + WTERMSIG(status);
and
return 128 + WTERMSIG(status);

This follows the shell convention where signal termination is reported as 128 + signal_number. So for SIGSEGV (11), the exit status would be 139.

Suggestion: Change to return tl::unexpected(128 + signal) to match the existing Ceph convention and make the exit status meaningful to scripts and monitoring tools.

context-impl is moved into the per-module jam file in 1.87 which
resulted in the following error when trying to set the context-impl:

```
Boost/tools/build/src/build/feature.jam:327: in validate-feature from
module feature
error: unknown feature "<context-impl>
```

context-impl setting was brought back in 1.90 boostorg/context@12ac945

context-impl is only set in our cmake files if WITH_ASAN is on. Hence,
FTBFS, were not common so far.

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
on vstart+asan:

```
==764060==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 11696 byte(s) in 20 object(s) allocated from:
    #0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
    #1 0x7f840291ff54 in _PyObject_GC_NewVar (/lib64/libpython3.9.so.1.0+0x11ff54) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)

Direct leak of 5892 byte(s) in 8 object(s) allocated from:
    #0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
    #1 0x7f840291aa22 in unicode_decode_utf8 (/lib64/libpython3.9.so.1.0+0x11aa22) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)

Direct leak of 3872 byte(s) in 7 object(s) allocated from:
    #0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
    #1 0x7f84029df4e4 in _PyObject_Malloc.part.0 (/lib64/libpython3.9.so.1.0+0x1df4e4) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)

Indirect leak of 3032 byte(s) in 4 object(s) allocated from:
    #0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
    #1 0x7f840291ff54 in _PyObject_GC_NewVar (/lib64/libpython3.9.so.1.0+0x11ff54) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)

Indirect leak of 1776 byte(s) in 3 object(s) allocated from:
    #0 0x7f8402cedcab in malloc (/usr/lib64/libasan.so.8+0xedcab) (BuildId: 84eaebdb799a67003ebf64e4287c60bd45cf5b03)
    #1 0x7f84029df4e4 in _PyObject_Malloc.part.0 (/lib64/libpython3.9.so.1.0+0x1df4e4) (BuildId: d5e7f95ca612f1fb71c0a7a1fed9c4c11b007bda)
```

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
child

Without this dedicated check, the bootstrap process would proceed to
decoding the buffer from the child proccess - and only then would fail.

This would result in a possibly misleading "failed to decode" instead of
"child signaled". See the following example where ASAN caught a SIGSEV in the child process:

```
Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: AddressSanitizer:DEADLYSIGNAL
Jan 20 09:27:16 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: =================================================================
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: SUMMARY: AddressSanitizer: stack-overflow (/usr/bin/ceph-osd-crimson+0x46e7a72) (BuildId: 2a86043f51c9be9cb19801e276fb3ee36239556a) in get_global_>
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: ==3==ABORTING
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: get_early_config: parent failed to decode
Jan 20 09:27:17 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: do_early_config returned error: -22
Jan 20 09:27:19 ceph-node-0 ceph-e818662e-f5e1-11f0-b263-525400908ba7-osd-1[12300]: ---------------------------------------
```

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-asan-fixes branch from 8db56f6 to ca60f8a Compare January 26, 2026 08:00
@Matan-B Matan-B requested a review from tchaikov January 26, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants