Skip to content

Prevent federation links from restarting during node shutdown#15258

Merged
michaelklishin merged 4 commits intomainfrom
mk-federation-shutdown-link-reconnection-guard
Jan 14, 2026
Merged

Prevent federation links from restarting during node shutdown#15258
michaelklishin merged 4 commits intomainfrom
mk-federation-shutdown-link-reconnection-guard

Conversation

@michaelklishin
Copy link
Copy Markdown
Collaborator

@michaelklishin michaelklishin commented Jan 14, 2026

or plugin shutdown, for that matter.

With this guardrail in place, nodes with hundreds or thousands of federation links will avoid potentially significant shutdown delays that have to do with
links being restarted while the node as a whole is preparing to shut down.

This state is node-local, as is the shutdown state, so this will not prevent links migrating between nodes (under mirrored_supervisor) from starting.

Per discussion with @dcorbacho @ansd.

Note that this PR cannot be backported exactly to v4.1.x and earlier branches. The federation plugin split in main first shipped in 4.2.0.

or plugin shutdown, for that matter.

With this guardrail in place, nodes with hundreds or
thousands of federation links will avoid potentially
significant shutdown delays that have to do with
links being restarted while the node as a whole is
preparing to shut down.

Per discussion with @dcorbacho @ansd.
Copy link
Copy Markdown
Member

@ansd ansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this PR out using a single node as follows:

make run-broker
./sbin/rabbitmq-plugins enable rabbitmq_exchange_federation
./sbin/rabbitmqctl set_parameter federation-upstream origin '{"uri":"amqp://localhost:5672"}'
./sbin/rabbitmqctl set_policy exchange-federation "^amq.direct" '{"federation-upstream-set":"all"}' --priority 10 --apply-to exchanges
./sbin/rabbitmqctl stop

Stopping RabbitMQ as above errors:

2026-01-14 09:14:38.488679+01:00 [info] <0.860.0> RabbitMQ is asked to stop...
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0> Stopping RabbitMQ applications and their dependencies in the following order:
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_exchange_federation
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_management
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_management_agent
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_web_dispatch
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_federation_common
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbit
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     khepri
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     ra
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     cowboy
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     oauth2_client
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     sysmon_handler
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbitmq_prelaunch
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     osiris
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     amqp_client
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     rabbit_common
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     jose
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     os_mon
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>     mnesia
2026-01-14 09:14:38.514449+01:00 [info] <0.860.0>
2026-01-14 09:14:38.514536+01:00 [info] <0.860.0> Stopping application 'rabbitmq_exchange_federation'
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>     application_master: shutdown_error
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>     rabbit_exchange_federation_app: {prep_stop,[[]]}
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>     error_info: {timeout,
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>                     {gen_server,call,
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>                         [application_controller,
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>                          {set_env,rabbitmq_federation_common,shutting_down,
2026-01-14 09:14:43.515924+01:00 [error] <0.672.0>                              true,[]}]}}
2026-01-14 09:14:43.516567+01:00 [debug] <0.672.0> Stopping pg scope rabbitmq_exchange_federation_pg_scope
2026-01-14 09:14:43.519106+01:00 [alert] <0.672.0> Member <0.697.0> stopped: normal
2026-01-14 09:14:43.519291+01:00 [info] <0.724.0> closing AMQP connection (127.0.0.1:61862 -> 127.0.0.1:5672 - Federation link (upstream: origin, policy: exchange-federation), vhost: '/', user: 'guest', duration: '1M, 2s')
2026-01-14 09:14:43.520981+01:00 [notice] <0.45.0> Application rabbitmq_exchange_federation exited with reason: stopped
2026-01-14 09:14:43.521082+01:00 [info] <0.860.0> Stopping application 'rabbitmq_management'
2026-01-14 09:14:43.523838+01:00 [warning] <0.474.0> HTTP listener registry could not find context rabbitmq_management_tls
2026-01-14 09:14:43.525296+01:00 [notice] <0.45.0> Application rabbitmq_management exited with reason: stopped
2026-01-14 09:14:43.525471+01:00 [info] <0.860.0> Stopping application 'rabbitmq_management_agent'
2026-01-14 09:14:43.527990+01:00 [notice] <0.45.0> Application rabbitmq_management_agent exited with reason: stopped
2026-01-14 09:14:43.528065+01:00 [info] <0.860.0> Stopping application 'rabbitmq_web_dispatch'
2026-01-14 09:14:43.529537+01:00 [notice] <0.45.0> Application rabbitmq_web_dispatch exited with reason: stopped
2026-01-14 09:14:43.529602+01:00 [info] <0.860.0> Stopping application 'rabbitmq_federation_common'
2026-01-14 09:14:43.530772+01:00 [notice] <0.45.0> Application rabbitmq_federation_common exited with reason: stopped
2026-01-14 09:14:43.530803+01:00 [info] <0.860.0> Stopping application 'rabbit'
2026-01-14 09:14:43.530847+01:00 [debug] <0.217.0> Change boot state to `stopping`

This avoids a classic deadlock in Erlang: when
an application_controller (AC) invokes a callback,
such as pre_stop/1, the function invoked cannot
use any OTP functions that would ultimately require
an AC response.

application:set_env/2 is one of such functions,
so with this commit we switch to a persistent term.
@michaelklishin
Copy link
Copy Markdown
Collaborator Author

I have re-created the branch to make GitHub pick up on commit a8347b2.

@michaelklishin michaelklishin merged commit ff4efe9 into main Jan 14, 2026
574 of 575 checks passed
@michaelklishin michaelklishin deleted the mk-federation-shutdown-link-reconnection-guard branch January 14, 2026 18:35
mergify bot pushed a commit that referenced this pull request Jan 14, 2026
(cherry picked from commit 80c8d7b)
michaelklishin added a commit that referenced this pull request Jan 14, 2026
This change guards against significant shutdown delays in nodes
managing hundreds or thousands of federation links that would
otherwise restart while the node prepares to shut down.

Uses a persistent term in rabbit_federation_app_state to avoid
a classic Erlang deadlock scenario where an application_controller
invokes callbacks like prep_stop/1.

Also makes forget_binding/2 more defensive by handling the case
where a binding key is not found in the map.

Backport of #15258 to v4.1.x.
michaelklishin added a commit that referenced this pull request Jan 14, 2026
Prevent federation links from restarting during node shutdown (backport #15258)
michaelklishin added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants