Skip to content

Flush mailbox before exit to prevent OOM during crash report formatting#27

Merged
michaelklishin merged 1 commit intomasterfrom
exit-oom
Mar 25, 2026
Merged

Flush mailbox before exit to prevent OOM during crash report formatting#27
michaelklishin merged 1 commit intomasterfrom
exit-oom

Conversation

@ansd
Copy link
Copy Markdown
Member

@ansd ansd commented Mar 25, 2026

What?

When a gen_batch_server process exits abnormally, proc_lib:exit_p/3 generates a crash report that includes the process mailbox contents (via proc_lib:get_messages/1). The default OTP logger handler then formats this crash report in the dying process itself (logger_h_common:log_to_binary/2) using io_lib_pretty with potentially unlimited depth and character limits.

For high-throughput gen_batch_server processes like the Ra WAL, the mailbox can contain thousands of messages with large binary payloads (e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte via io_lib_pretty:print_length_binary causes unbounded memory growth, observed at 10-25+ GB before the node runs out of memory.

The fix drains the mailbox after terminate/2 but before exit/1 or erlang:raise/3 on every exit path. This way proc_lib's crash report finds an empty mailbox, keeping the report small. The exit reason semantics are preserved so supervisors still see the abnormal exit and restart the process normally.

Repro steps

Without this commit, the following repro steps caused RabbitMQ nodes to run out of memory. In tanzu-rabbitmq:

  1. put rabbit_chaos module in deps/rabbit/src/
  2. make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1
omq amqp --uri amqp://localhost:5672/,amqp://localhost:5673/,amqp://localhost:5674/ \
                     --queues jms --queue-args 'x-selector-fields=*',x-queue-leader-locator=balanced \
                     -t /queues/jmsq-%d -T /queues/jmsq-%d -s '{{ randInt 1000 100000 }}' \
                     -x 10 -y 9 --consumer-credits 1000 \
                     --amqp-app-property 'color=red,blue,green,yellow' -c  1000 \
                     --amqp-jms-selector "color <> 'red'"
for i in 1 2 3; do rabbitmqctl -n rabbit-$i eval 'rabbit_chaos:begin_jms_chaos().'; done

After 30-90 seconds, process memory of ra_jms_queues_log_wal grew huge (several GBs).

This process had the following stacktrace:

./sbin/rabbitmqctl -n rabbit-3 eval 'Pid = whereis(ra_jms_queues_log_wal), erlang:process_info(Pid, [current_function, current_location, current_stacktrace, initial_call, message_queue_len, registered_name, status,stack_size, backtrace]).'

shows the following:

[{current_function,{lists,reverse,2}},
 {current_location,{lists,reverse,2,[]}},
 {current_stacktrace,[{lists,reverse,1,[{file,"lists.erl"},{line,298}]},
                      {io_lib,write_binary,3,[{file,"io_lib.erl"},{line,860}]},
                      {io_lib_pretty,print_length_binary,7,
                                     [{file,"io_lib_pretty.erl"},{line,1139}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1058}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1059}]},
                      {io_lib_pretty,print_length_list,7,
                                     [{file,"io_lib_pretty.erl"},{line,1043}]},
                      {io_lib_pretty,print_length,7,
                                     [{file,"io_lib_pretty.erl"},
                                      {line,906}]}]},
 {initial_call,{proc_lib,init_p,5}},
 {message_queue_len,5479},
 {registered_name,ra_jms_queues_log_wal},
 {status,running},
 {stack_size,443},

...

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0... \ny(9)     Catch 0x000070608720a8e9 (logger_backend:call_handlers/3 + 665)\n\n0x0000705ce667de68 Return addr 0x0000706087248774 (proc_lib:exit_p/3 + 364)\ny(0)     [{gen_batch_server,handle_batch_result,3,[{file,\"src/gen_batch_server.erl\"},{line,424}]},{gen_batch_server,loop_batched,2,[{file,\"src/gen_batch_server.erl\"},{line,354}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,333}]}]\ny(1)     chaos\ny(2)     exit\n\n0x0000705ce667de88 Return addr 0x0000706087001248 (<terminate process normally>)\n">>}]

 ### What?

When a gen_batch_server process exits abnormally, proc_lib:exit_p/3
generates a crash report that includes the process mailbox contents
(via proc_lib:get_messages/1). The default OTP logger handler then
formats this crash report in the dying process itself
(logger_h_common:log_to_binary/2) using io_lib_pretty with potentially
unlimited depth and character limits.

For high-throughput gen_batch_server processes like the Ra WAL, the
mailbox can contain thousands of messages with large binary payloads
(e.g. AMQP message bodies). Pretty-printing these binaries byte-by-byte
via io_lib_pretty:print_length_binary causes unbounded memory growth,
observed at 10-25+ GB before the node runs out of memory.

The fix drains the mailbox after terminate/2 but before exit/1 or
erlang:raise/3 on every exit path. This way proc_lib's crash report
finds an empty mailbox, keeping the report small. The exit reason
semantics are preserved so supervisors still see the abnormal exit
and restart the process normally.

 ### Repro steps

**Without** this commit, the following repro steps caused RabbitMQ nodes to
run out of memory. In tanzu-rabbitmq:
1. put `rabbit_chaos` module in `deps/rabbit/src/`
2. `make start-cluster NODES=3 ENABLED_PLUGINS="rabbitmq_jms_management" FULL=1`
3.
```
omq amqp --uri amqp://localhost:5672/,amqp://localhost:5673/,amqp://localhost:5674/ \
                     --queues jms --queue-args 'x-selector-fields=*',x-queue-leader-locator=balanced \
                     -t /queues/jmsq-%d -T /queues/jmsq-%d -s '{{ randInt 1000 100000 }}' \
                     -x 10 -y 9 --consumer-credits 1000 \
                     --amqp-app-property 'color=red,blue,green,yellow' -c  1000 \
                     --amqp-jms-selector "color <> 'red'"
```
4.
```
for i in 1 2 3; do rabbitmqctl -n rabbit-$i eval 'rabbit_chaos:begin_jms_chaos().'; done
```

After 30-90 seconds, process memory of `ra_jms_queues_log_wal` grew huge
(several GBs).

This process had the following stacktrace:
```
./sbin/rabbitmqctl -n rabbit-3 eval 'Pid = whereis(ra_jms_queues_log_wal), erlang:process_info(Pid, [current_function, current_location, current_stacktrace, initial_call, message_queue_len, registered_name, status,stack_size, backtrace]).'
```
shows the following:
```
[{current_function,{lists,reverse,2}},
 {current_location,{lists,reverse,2,[]}},
 {current_stacktrace,[{lists,reverse,1,[{file,"lists.erl"},{line,298}]},
                      {io_lib,write_binary,3,[{file,"io_lib.erl"},{line,860}]},
                      {io_lib_pretty,print_length_binary,7,
                                     [{file,"io_lib_pretty.erl"},{line,1139}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1058}]},
                      {io_lib_pretty,print_length_list1,7,
                                     [{file,"io_lib_pretty.erl"},{line,1059}]},
                      {io_lib_pretty,print_length_list,7,
                                     [{file,"io_lib_pretty.erl"},{line,1043}]},
                      {io_lib_pretty,print_length,7,
                                     [{file,"io_lib_pretty.erl"},
                                      {line,906}]}]},
 {initial_call,{proc_lib,init_p,5}},
 {message_queue_len,5479},
 {registered_name,ra_jms_queues_log_wal},
 {status,running},
 {stack_size,443},

...

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0... \ny(9)     Catch 0x000070608720a8e9 (logger_backend:call_handlers/3 + 665)\n\n0x0000705ce667de68 Return addr 0x0000706087248774 (proc_lib:exit_p/3 + 364)\ny(0)     [{gen_batch_server,handle_batch_result,3,[{file,\"src/gen_batch_server.erl\"},{line,424}]},{gen_batch_server,loop_batched,2,[{file,\"src/gen_batch_server.erl\"},{line,354}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,333}]}]\ny(1)     chaos\ny(2)     exit\n\n0x0000705ce667de88 Return addr 0x0000706087001248 (<terminate process normally>)\n">>}]
```
@ansd ansd requested review from kjnilsson and mkuratczyk March 25, 2026 17:13
@michaelklishin michaelklishin added this to the 0.9.2 milestone Mar 25, 2026
@michaelklishin michaelklishin merged commit 972ae98 into master Mar 25, 2026
3 checks passed
@michaelklishin michaelklishin deleted the exit-oom branch March 25, 2026 23:41
@michaelklishin
Copy link
Copy Markdown
Contributor

0.9.2

@michaelklishin
Copy link
Copy Markdown
Contributor

FTR, as part of reviewing this change for 0.9.2, @the-mikedavis and I had a hypothesis about this very change potentially being necessary when recovery Ra log but testing and further analysis suggests that it is not the case.

@michaelklishin
Copy link
Copy Markdown
Contributor

Ra 3.1.2 now requires 0.9.2 with this change as the minimum.

jimsynz pushed a commit to jimsynz/neonfs that referenced this pull request Mar 29, 2026
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [ra](https://hex.pm/packages/ra) ([source](https://github.com/rabbitmq/ra)) | prod | patch | `3.1.0` → `3.1.2` |

---

### Release Notes

<details>
<summary>rabbitmq/ra (ra)</summary>

### [`v3.1.2`](https://github.com/rabbitmq/ra/releases/tag/v3.1.2)

[Compare Source](rabbitmq/ra@v3.1.1...v3.1.2)

#### What's Changed

- `gen_batch_server` was bumped to `0.9.2` to significantly reduce the risk of an [OOM scenario](rabbitmq/gen-batch-server#27) that affects Ra-based systems

**Full Changelog**: <rabbitmq/ra@v3.1.1...v3.1.2>

### [`v3.1.1`](https://github.com/rabbitmq/ra/releases/tag/v3.1.1)

[Compare Source](rabbitmq/ra@v3.1.0...v3.1.1)

#### What's Changed

- Use the Unicode translation modifier to log server IDs and cluster names by [@&#8203;the-mikedavis](https://github.com/the-mikedavis) in [#&#8203;599](rabbitmq/ra#599)
- Export `ra:membership()` type by [@&#8203;the-mikedavis](https://github.com/the-mikedavis) in [#&#8203;603](rabbitmq/ra#603)
- Do not clear pending if points to the checkpoint by [@&#8203;deadtrickster](https://github.com/deadtrickster) in [#&#8203;602](rabbitmq/ra#602)
- Fix doubly-wrapped log entries after sparse write and recovery by [@&#8203;ansd](https://github.com/ansd) in [#&#8203;601](rabbitmq/ra#601)

**Full Changelog**: <rabbitmq/ra@v3.1.0...v3.1.1>

</details>

---

### Configuration

📅 **Schedule**: Branch creation - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland, Automerge - Between 12:00 AM and 03:59 AM ( * 0-3 * * * ) in timezone Pacific/Auckland.

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My45MC4xIiwidXBkYXRlZEluVmVyIjoiNDMuOTAuMSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsicmVub3ZhdGUiXX0=-->

Reviewed-on: https://harton.dev/project-neon/neonfs/pulls/84
Co-authored-by: Renovate Bot <bot@harton.nz>
Co-committed-by: Renovate Bot <bot@harton.nz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants