Flush mailbox before exit to prevent OOM during crash report formatting by ansd · Pull Request #27 · rabbitmq/gen-batch-server

ansd · 2026-03-25T16:24:48Z

What?

When a gen_batch_server process exits abnormally, proc_lib:exit_p/3 generates a crash report that includes the process mailbox contents (via proc_lib:get_messages/1). The default OTP logger handler then formats this crash report in the dying process itself (logger_h_common:log_to_binary/2) using io_lib_pretty with potentially unlimited depth and character limits.

For high-throughput gen_batch_server processes like the Ra WAL, the mailbox can contain thousands of messages with large binary payloads (e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte via io_lib_pretty:print_length_binary causes unbounded memory growth, observed at 10-25+ GB before the node runs out of memory.

The fix drains the mailbox after terminate/2 but before exit/1 or erlang:raise/3 on every exit path. This way proc_lib's crash report finds an empty mailbox, keeping the report small. The exit reason semantics are preserved so supervisors still see the abnormal exit and restart the process normally.

Repro steps

Without this commit, the following repro steps caused RabbitMQ nodes to run out of memory. In tanzu-rabbitmq:

put rabbit_chaos module in deps/rabbit/src/
make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1

omq amqp --uri amqp://localhost:5672/,amqp://localhost:5673/,amqp://localhost:5674/ \
                     --queues jms --queue-args 'x-selector-fields=*',x-queue-leader-locator=balanced \
                     -t /queues/jmsq-%d -T /queues/jmsq-%d -s '{{ randInt 1000 100000 }}' \
                     -x 10 -y 9 --consumer-credits 1000 \
                     --amqp-app-property 'color=red,blue,green,yellow' -c  1000 \
                     --amqp-jms-selector "color <> 'red'"

for i in 1 2 3; do rabbitmqctl -n rabbit-$i eval 'rabbit_chaos:begin_jms_chaos().'; done

After 30-90 seconds, process memory of ra_jms_queues_log_wal grew huge (several GBs).

This process had the following stacktrace:

./sbin/rabbitmqctl -n rabbit-3 eval 'Pid = whereis(ra_jms_queues_log_wal), erlang:process_info(Pid, [current_function, current_location, current_stacktrace, initial_call, message_queue_len, registered_name, status,stack_size, backtrace]).'

shows the following:

[{current_function,{lists,reverse,2}},
 {current_location,{lists,reverse,2,[]}},
 {current_stacktrace,[{lists,reverse,1,[{file,"lists.erl"},{line,298}]},
                      {io_lib,write_binary,3,[{file,"io_lib.erl"},{line,860}]},
                      {io_lib_pretty,print_length_binary,7,
                                     [{file,"io_lib_pretty.erl"},{line,1139}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1058}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1059}]},
                      {io_lib_pretty,print_length_list,7,
                                     [{file,"io_lib_pretty.erl"},{line,1043}]},
                      {io_lib_pretty,print_length,7,
                                     [{file,"io_lib_pretty.erl"},
                                      {line,906}]}]},
 {initial_call,{proc_lib,init_p,5}},
 {message_queue_len,5479},
 {registered_name,ra_jms_queues_log_wal},
 {status,running},
 {stack_size,443},

...

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0... \ny(9)     Catch 0x000070608720a8e9 (logger_backend:call_handlers/3 + 665)\n\n0x0000705ce667de68 Return addr 0x0000706087248774 (proc_lib:exit_p/3 + 364)\ny(0)     [{gen_batch_server,handle_batch_result,3,[{file,\"src/gen_batch_server.erl\"},{line,424}]},{gen_batch_server,loop_batched,2,[{file,\"src/gen_batch_server.erl\"},{line,354}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,333}]}]\ny(1)     chaos\ny(2)     exit\n\n0x0000705ce667de88 Return addr 0x0000706087001248 (<terminate process normally>)\n">>}]

### What? When a gen_batch_server process exits abnormally, proc_lib:exit_p/3 generates a crash report that includes the process mailbox contents (via proc_lib:get_messages/1). The default OTP logger handler then formats this crash report in the dying process itself (logger_h_common:log_to_binary/2) using io_lib_pretty with potentially unlimited depth and character limits. For high-throughput gen_batch_server processes like the Ra WAL, the mailbox can contain thousands of messages with large binary payloads (e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte via io_lib_pretty:print_length_binary causes unbounded memory growth, observed at 10-25+ GB before the node runs out of memory. The fix drains the mailbox after terminate/2 but before exit/1 or erlang:raise/3 on every exit path. This way proc_lib's crash report finds an empty mailbox, keeping the report small. The exit reason semantics are preserved so supervisors still see the abnormal exit and restart the process normally. ### Repro steps **Without** this commit, the following repro steps caused RabbitMQ nodes to run out of memory. In tanzu-rabbitmq: 1. put `rabbit_chaos` module in `deps/rabbit/src/` 2. `make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1` 3. ``` omq amqp --uri amqp://localhost:5672/,amqp://localhost:5673/,amqp://localhost:5674/ \ --queues jms --queue-args 'x-selector-fields=*',x-queue-leader-locator=balanced \ -t /queues/jmsq-%d -T /queues/jmsq-%d -s '{{ randInt 1000 100000 }}' \ -x 10 -y 9 --consumer-credits 1000 \ --amqp-app-property 'color=red,blue,green,yellow' -c 1000 \ --amqp-jms-selector "color <> 'red'" ``` 4. ``` for i in 1 2 3; do rabbitmqctl -n rabbit-$i eval 'rabbit_chaos:begin_jms_chaos().'; done ``` After 30-90 seconds, process memory of `ra_jms_queues_log_wal` grew huge (several GBs). This process had the following stacktrace: ``` ./sbin/rabbitmqctl -n rabbit-3 eval 'Pid = whereis(ra_jms_queues_log_wal), erlang:process_info(Pid, [current_function, current_location, current_stacktrace, initial_call, message_queue_len, registered_name, status,stack_size, backtrace]).' ``` shows the following: ``` [{current_function,{lists,reverse,2}}, {current_location,{lists,reverse,2,[]}}, {current_stacktrace,[{lists,reverse,1,[{file,"lists.erl"},{line,298}]}, {io_lib,write_binary,3,[{file,"io_lib.erl"},{line,860}]}, {io_lib_pretty,print_length_binary,7, [{file,"io_lib_pretty.erl"},{line,1139}]}, {io_lib_pretty,print_length_list1,7, [{file,"io_lib_pretty.erl"},{line,1058}]}, {io_lib_pretty,print_length_list1,7, [{file,"io_lib_pretty.erl"},{line,1059}]}, {io_lib_pretty,print_length_list,7, [{file,"io_lib_pretty.erl"},{line,1043}]}, {io_lib_pretty,print_length,7, [{file,"io_lib_pretty.erl"}, {line,906}]}]}, {initial_call,{proc_lib,init_p,5}}, {message_queue_len,5479}, {registered_name,ra_jms_queues_log_wal}, {status,running}, {stack_size,443}, ... 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0... \ny(9) Catch 0x000070608720a8e9 (logger_backend:call_handlers/3 + 665)\n\n0x0000705ce667de68 Return addr 0x0000706087248774 (proc_lib:exit_p/3 + 364)\ny(0) [{gen_batch_server,handle_batch_result,3,[{file,\"src/gen_batch_server.erl\"},{line,424}]},{gen_batch_server,loop_batched,2,[{file,\"src/gen_batch_server.erl\"},{line,354}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,333}]}]\ny(1) chaos\ny(2) exit\n\n0x0000705ce667de88 Return addr 0x0000706087001248 (<terminate process normally>)\n">>}] ```

michaelklishin · 2026-03-25T23:44:05Z

0.9.2

michaelklishin · 2026-03-26T00:06:22Z

FTR, as part of reviewing this change for 0.9.2, @the-mikedavis and I had a hypothesis about this very change potentially being necessary when recovery Ra log but testing and further analysis suggests that it is not the case.

michaelklishin · 2026-03-26T00:21:58Z

Ra 3.1.2 now requires 0.9.2 with this change as the minimum.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [ra](https://hex.pm/packages/ra) ([source](https://github.com/rabbitmq/ra)) | prod | patch | `3.1.0` → `3.1.2` | --- ### Release Notes <details> <summary>rabbitmq/ra (ra)</summary> ### [`v3.1.2`](https://github.com/rabbitmq/ra/releases/tag/v3.1.2) [Compare Source](rabbitmq/ra@v3.1.1...v3.1.2) #### What's Changed - `gen_batch_server` was bumped to `0.9.2` to significantly reduce the risk of an [OOM scenario](rabbitmq/gen-batch-server#27) that affects Ra-based systems **Full Changelog**: <rabbitmq/ra@v3.1.1...v3.1.2> ### [`v3.1.1`](https://github.com/rabbitmq/ra/releases/tag/v3.1.1) [Compare Source](rabbitmq/ra@v3.1.0...v3.1.1) #### What's Changed - Use the Unicode translation modifier to log server IDs and cluster names by [@the-mikedavis](https://github.com/the-mikedavis) in [#599](rabbitmq/ra#599) - Export `ra:membership()` type by [@the-mikedavis](https://github.com/the-mikedavis) in [#603](rabbitmq/ra#603) - Do not clear pending if points to the checkpoint by [@deadtrickster](https://github.com/deadtrickster) in [#602](rabbitmq/ra#602) - Fix doubly-wrapped log entries after sparse write and recovery by [@ansd](https://github.com/ansd) in [#601](rabbitmq/ra#601) **Full Changelog**: <rabbitmq/ra@v3.1.0...v3.1.1> </details> --- ### Configuration 📅 **Schedule**: Branch creation - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland, Automerge - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland. 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Reviewed-on: https://harton.dev/project-neon/neonfs/pulls/84 Co-authored-by: Renovate Bot <bot@harton.nz> Co-committed-by: Renovate Bot <bot@harton.nz>

ansd requested review from kjnilsson and mkuratczyk March 25, 2026 17:13

michaelklishin approved these changes Mar 25, 2026

View reviewed changes

michaelklishin added this to the 0.9.2 milestone Mar 25, 2026

the-mikedavis approved these changes Mar 25, 2026

View reviewed changes

michaelklishin merged commit 972ae98 into master Mar 25, 2026
3 checks passed

michaelklishin deleted the exit-oom branch March 25, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush mailbox before exit to prevent OOM during crash report formatting#27

Flush mailbox before exit to prevent OOM during crash report formatting#27
michaelklishin merged 1 commit intomasterfrom
exit-oom

ansd commented Mar 25, 2026

Uh oh!

Uh oh!

michaelklishin commented Mar 25, 2026

Uh oh!

michaelklishin commented Mar 26, 2026

Uh oh!

michaelklishin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ansd commented Mar 25, 2026

What?

Repro steps

Uh oh!

Uh oh!

michaelklishin commented Mar 25, 2026

Uh oh!

michaelklishin commented Mar 26, 2026

Uh oh!

michaelklishin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants