Flush mailbox before exit to prevent OOM during crash report formatting#27
Merged
michaelklishin merged 1 commit intomasterfrom Mar 25, 2026
Merged
Flush mailbox before exit to prevent OOM during crash report formatting#27michaelklishin merged 1 commit intomasterfrom
michaelklishin merged 1 commit intomasterfrom
Conversation
### What?
When a gen_batch_server process exits abnormally, proc_lib:exit_p/3
generates a crash report that includes the process mailbox contents
(via proc_lib:get_messages/1). The default OTP logger handler then
formats this crash report in the dying process itself
(logger_h_common:log_to_binary/2) using io_lib_pretty with potentially
unlimited depth and character limits.
For high-throughput gen_batch_server processes like the Ra WAL, the
mailbox can contain thousands of messages with large binary payloads
(e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte
via io_lib_pretty:print_length_binary causes unbounded memory growth,
observed at 10-25+ GB before the node runs out of memory.
The fix drains the mailbox after terminate/2 but before exit/1 or
erlang:raise/3 on every exit path. This way proc_lib's crash report
finds an empty mailbox, keeping the report small. The exit reason
semantics are preserved so supervisors still see the abnormal exit
and restart the process normally.
### Repro steps
**Without** this commit, the following repro steps caused RabbitMQ nodes to
run out of memory. In tanzu-rabbitmq:
1. put `rabbit_chaos` module in `deps/rabbit/src/`
2. `make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1`
3.
```
omq amqp --uri amqp://localhost:5672/,amqp://localhost:5673/,amqp://localhost:5674/ \
--queues jms --queue-args 'x-selector-fields=*',x-queue-leader-locator=balanced \
-t /queues/jmsq-%d -T /queues/jmsq-%d -s '{{ randInt 1000 100000 }}' \
-x 10 -y 9 --consumer-credits 1000 \
--amqp-app-property 'color=red,blue,green,yellow' -c 1000 \
--amqp-jms-selector "color <> 'red'"
```
4.
```
for i in 1 2 3; do rabbitmqctl -n rabbit-$i eval 'rabbit_chaos:begin_jms_chaos().'; done
```
After 30-90 seconds, process memory of `ra_jms_queues_log_wal` grew huge
(several GBs).
This process had the following stacktrace:
```
./sbin/rabbitmqctl -n rabbit-3 eval 'Pid = whereis(ra_jms_queues_log_wal), erlang:process_info(Pid, [current_function, current_location, current_stacktrace, initial_call, message_queue_len, registered_name, status,stack_size, backtrace]).'
```
shows the following:
```
[{current_function,{lists,reverse,2}},
{current_location,{lists,reverse,2,[]}},
{current_stacktrace,[{lists,reverse,1,[{file,"lists.erl"},{line,298}]},
{io_lib,write_binary,3,[{file,"io_lib.erl"},{line,860}]},
{io_lib_pretty,print_length_binary,7,
[{file,"io_lib_pretty.erl"},{line,1139}]},
{io_lib_pretty,print_length_list1,7,
[{file,"io_lib_pretty.erl"},{line,1058}]},
{io_lib_pretty,print_length_list1,7,
[{file,"io_lib_pretty.erl"},{line,1059}]},
{io_lib_pretty,print_length_list,7,
[{file,"io_lib_pretty.erl"},{line,1043}]},
{io_lib_pretty,print_length,7,
[{file,"io_lib_pretty.erl"},
{line,906}]}]},
{initial_call,{proc_lib,init_p,5}},
{message_queue_len,5479},
{registered_name,ra_jms_queues_log_wal},
{status,running},
{stack_size,443},
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0... \ny(9) Catch 0x000070608720a8e9 (logger_backend:call_handlers/3 + 665)\n\n0x0000705ce667de68 Return addr 0x0000706087248774 (proc_lib:exit_p/3 + 364)\ny(0) [{gen_batch_server,handle_batch_result,3,[{file,\"src/gen_batch_server.erl\"},{line,424}]},{gen_batch_server,loop_batched,2,[{file,\"src/gen_batch_server.erl\"},{line,354}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,333}]}]\ny(1) chaos\ny(2) exit\n\n0x0000705ce667de88 Return addr 0x0000706087001248 (<terminate process normally>)\n">>}]
```
michaelklishin
approved these changes
Mar 25, 2026
the-mikedavis
approved these changes
Mar 25, 2026
Contributor
Contributor
|
FTR, as part of reviewing this change for |
Contributor
|
Ra |
jimsynz
pushed a commit
to jimsynz/neonfs
that referenced
this pull request
Mar 29, 2026
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [ra](https://hex.pm/packages/ra) ([source](https://github.com/rabbitmq/ra)) | prod | patch | `3.1.0` → `3.1.2` | --- ### Release Notes <details> <summary>rabbitmq/ra (ra)</summary> ### [`v3.1.2`](https://github.com/rabbitmq/ra/releases/tag/v3.1.2) [Compare Source](rabbitmq/ra@v3.1.1...v3.1.2) #### What's Changed - `gen_batch_server` was bumped to `0.9.2` to significantly reduce the risk of an [OOM scenario](rabbitmq/gen-batch-server#27) that affects Ra-based systems **Full Changelog**: <rabbitmq/ra@v3.1.1...v3.1.2> ### [`v3.1.1`](https://github.com/rabbitmq/ra/releases/tag/v3.1.1) [Compare Source](rabbitmq/ra@v3.1.0...v3.1.1) #### What's Changed - Use the Unicode translation modifier to log server IDs and cluster names by [@​the-mikedavis](https://github.com/the-mikedavis) in [#​599](rabbitmq/ra#599) - Export `ra:membership()` type by [@​the-mikedavis](https://github.com/the-mikedavis) in [#​603](rabbitmq/ra#603) - Do not clear pending if points to the checkpoint by [@​deadtrickster](https://github.com/deadtrickster) in [#​602](rabbitmq/ra#602) - Fix doubly-wrapped log entries after sparse write and recovery by [@​ansd](https://github.com/ansd) in [#​601](rabbitmq/ra#601) **Full Changelog**: <rabbitmq/ra@v3.1.0...v3.1.1> </details> --- ### Configuration 📅 **Schedule**: Branch creation - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland, Automerge - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland. 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My45MC4xIiwidXBkYXRlZEluVmVyIjoiNDMuOTAuMSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsicmVub3ZhdGUiXX0=--> Reviewed-on: https://harton.dev/project-neon/neonfs/pulls/84 Co-authored-by: Renovate Bot <bot@harton.nz> Co-committed-by: Renovate Bot <bot@harton.nz>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
When a gen_batch_server process exits abnormally, proc_lib:exit_p/3 generates a crash report that includes the process mailbox contents (via proc_lib:get_messages/1). The default OTP logger handler then formats this crash report in the dying process itself (logger_h_common:log_to_binary/2) using io_lib_pretty with potentially unlimited depth and character limits.
For high-throughput gen_batch_server processes like the Ra WAL, the mailbox can contain thousands of messages with large binary payloads (e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte via io_lib_pretty:print_length_binary causes unbounded memory growth, observed at 10-25+ GB before the node runs out of memory.
The fix drains the mailbox after terminate/2 but before exit/1 or erlang:raise/3 on every exit path. This way proc_lib's crash report finds an empty mailbox, keeping the report small. The exit reason semantics are preserved so supervisors still see the abnormal exit and restart the process normally.
Repro steps
Without this commit, the following repro steps caused RabbitMQ nodes to run out of memory. In tanzu-rabbitmq:
rabbit_chaosmodule indeps/rabbit/src/make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1After 30-90 seconds, process memory of
ra_jms_queues_log_walgrew huge (several GBs).This process had the following stacktrace:
shows the following: