Skip to content

rabbitmq_*_federation: Stop links during plugin stop#14054

Merged
dumbbell merged 1 commit intomainfrom
terminate-links-when-federation-plugins-stop
Jun 11, 2025
Merged

rabbitmq_*_federation: Stop links during plugin stop#14054
dumbbell merged 1 commit intomainfrom
terminate-links-when-federation-plugins-stop

Conversation

@dumbbell
Copy link
Copy Markdown
Collaborator

@dumbbell dumbbell commented Jun 10, 2025

Why

Links are started by the plugins but put under the rabbit supervision tree. The federation plugins supervision tree is empty unfortunately...

Links are stopped by a boot step executed by rabbit, as a consequence of unregistering the plugins' parameters.

Unfortunately, links can be terminated if the channel, and implicitly the connection stops. This happens when the amqp_client application stops.

We end up with a race here:

  • Because the federation plugins supervision trees are empty and the application stop functions barely stop the pg group (which doesn't terminate the group members), nothing waits for the links to stop. Therefore, rabbit can stop `amqp_client' which is a dependency of the federation plugins. Therefore, the links underlying channels and connections are stopped.

  • rabbit unregister the federation parameters, terminating the links. The exchange links terminate/2 function needs the channel to delete the remote queue. But the channel and the underlying connection might be gone.

This simply logs a badmatch exception:

[error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
[error] <0.884.0>                                                                                         {error,
[error] <0.884.0>                                                                                          {noproc,
[error] <0.884.0>                                                                                           {gen_server,
[error] <0.884.0>                                                                                            call,
[error] <0.884.0>                                                                                            [<0.911.0>,
[error] <0.884.0>                                                                                             {command,
[error] <0.884.0>                                                                                              {open_channel,
[error] <0.884.0>                                                                                               none,
[error] <0.884.0>                                                                                               {amqp_selective_consumer,
[error] <0.884.0>                                                                                                []}}},
[error] <0.884.0>                                                                                             130000]}}}}

How

The solution is to make sure links are stopped as part of the stop of the plugins.

rabbit_federation_pg:stop_scope/1 is expanded to stop all members of all groups in this scope, before terminating the pg scope itself. The new code waits for the stopped processes to exit.

We have to handle the EXIT signal in the link processes and change their restart strategy in their parent supervisor from permanent to transient. This ensures they are restarted only if they crash. This also skips a error log message about each stopped link.

@dumbbell dumbbell requested review from dcorbacho and mkuratczyk June 10, 2025 12:25
@dumbbell dumbbell self-assigned this Jun 10, 2025
[Why]
Links are started by the plugins but put under the `rabbit` supervision
tree. The federation plugins supervision tree is empty unfortunately...

Links are stopped by a boot step executed by `rabbit`, as a concequence
of unregistering the plugins' parameters.

Unfortunately, links can be terminated if the channel, and implicitly
the connection stops. This happens when the `amqp_client` application
stops.

We end up with a race here:

* Because the federation plugins supervision trees are empty and the
  application stop functions barely stop the pg group (which doesn't
  terminate the group members), nothing waits for the links to stop.
  Therefore, `rabbit` can stop `amqp_client' which is a dependency of
  the federation plugins. Therefore, the links underlying channels and
  connections are stopped.

* `rabbit` unregister the federation parameters, terminating the links.
  The exchange links `terminate/2` function needs the channel to delete
  the remote queue. But the channel and the underlying connection might
  be gone.

This simply logs a `badmatch` exception:

    [error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
    [error] <0.884.0>                                                                                         {error,
    [error] <0.884.0>                                                                                          {noproc,
    [error] <0.884.0>                                                                                           {gen_server,
    [error] <0.884.0>                                                                                            call,
    [error] <0.884.0>                                                                                            [<0.911.0>,
    [error] <0.884.0>                                                                                             {command,
    [error] <0.884.0>                                                                                              {open_channel,
    [error] <0.884.0>                                                                                               none,
    [error] <0.884.0>                                                                                               {amqp_selective_consumer,
    [error] <0.884.0>                                                                                                []}}},
    [error] <0.884.0>                                                                                             130000]}}}}

[How]
The solution is to make sure links are stopped as part of the stop of
the plugins.

`rabbit_federation_pg:stop_scope/1` is expanded to stop all members of
all groups in this scope, before terminating the pg scope itself. The
new code waits for the stopped processes to exit.

We have to handle the `EXIT` signal in the link processes and change
their restart strategy in their parent supervisor from permanent to
transient. This ensures they are restarted only if they crash. This also
skips a error log message about each stopped link.
@dumbbell dumbbell force-pushed the terminate-links-when-federation-plugins-stop branch from bdf095c to 033ab45 Compare June 11, 2025 06:21
@dumbbell dumbbell marked this pull request as ready for review June 11, 2025 07:17
@dumbbell dumbbell merged commit f84828e into main Jun 11, 2025
564 of 565 checks passed
@dumbbell dumbbell deleted the terminate-links-when-federation-plugins-stop branch June 11, 2025 07:17
@michaelklishin michaelklishin added this to the 4.2.0 milestone Jun 11, 2025
ansd added a commit that referenced this pull request Jan 15, 2026
 ## What?

Federation links started in the federation plugins are put
under the `rabbit` app supervision tree (unfortunately).

This commit ensures that the entire federation supervision hierarchies
(including all federation links) are stopped **before** stopping app
`rabbit` when stopping RabbittMQ.

 ## Why?

Previously, we've seen cases where hundreds of federation links are
stopped during the shutdown procedure in app `rabbit` leading to
federation link restarts happening in parallel to vhosts being stopped.
In one case, the shutdown of app `rabbit` even got stuck (although there
is no evidence that federation was the problem).

Either way, the cleaner appraoch is to gracefully stop all federation
links, i.e. the entire supervision hierarchy under
`rabbit_exchange_federation_sup` and `rabbit_queue_federation_sup`
when stopping the federation apps, i.e. **before** proceeding to stop
app `rabbit`.

 ## How?

The boot step cleanup steps for the federation plugins are skipped when
stopping RabbitMQ.

Hence, this commit ensures that the supervisors are stopped in the
stop/1 application callback.

This commit does something similar to #14054
but uses a simpler approach.
michaelklishin pushed a commit that referenced this pull request Jan 16, 2026
 ## What?

Federation links started in the federation plugins are put
under the `rabbit` app supervision tree (unfortunately).

This commit ensures that the entire federation supervision hierarchies
(including all federation links) are stopped **before** stopping app
`rabbit` when stopping RabbittMQ.

 ## Why?

Previously, we've seen cases where hundreds of federation links are
stopped during the shutdown procedure in app `rabbit` leading to
federation link restarts happening in parallel to vhosts being stopped.
In one case, the shutdown of app `rabbit` even got stuck (although there
is no evidence that federation was the problem).

Either way, the cleaner appraoch is to gracefully stop all federation
links, i.e. the entire supervision hierarchy under
`rabbit_exchange_federation_sup` and `rabbit_queue_federation_sup`
when stopping the federation apps, i.e. **before** proceeding to stop
app `rabbit`.

 ## How?

The boot step cleanup steps for the federation plugins are skipped when
stopping RabbitMQ.

Hence, this commit ensures that the supervisors are stopped in the
stop/1 application callback.

This commit does something similar to #14054
but uses a simpler approach.

(cherry picked from commit 8bffa58)
mergify bot pushed a commit that referenced this pull request Jan 16, 2026
 ## What?

Federation links started in the federation plugins are put
under the `rabbit` app supervision tree (unfortunately).

This commit ensures that the entire federation supervision hierarchies
(including all federation links) are stopped **before** stopping app
`rabbit` when stopping RabbittMQ.

 ## Why?

Previously, we've seen cases where hundreds of federation links are
stopped during the shutdown procedure in app `rabbit` leading to
federation link restarts happening in parallel to vhosts being stopped.
In one case, the shutdown of app `rabbit` even got stuck (although there
is no evidence that federation was the problem).

Either way, the cleaner appraoch is to gracefully stop all federation
links, i.e. the entire supervision hierarchy under
`rabbit_exchange_federation_sup` and `rabbit_queue_federation_sup`
when stopping the federation apps, i.e. **before** proceeding to stop
app `rabbit`.

 ## How?

The boot step cleanup steps for the federation plugins are skipped when
stopping RabbitMQ.

Hence, this commit ensures that the supervisors are stopped in the
stop/1 application callback.

This commit does something similar to #14054
but uses a simpler approach.

(cherry picked from commit 8bffa58)
(cherry picked from commit 512553e)

# Conflicts:
#	deps/rabbitmq_federation_common/src/rabbit_federation_pg.erl
michaelklishin pushed a commit that referenced this pull request Feb 24, 2026
 ## What?

Federation links started in the federation plugins are put
under the `rabbit` app supervision tree (unfortunately).

This commit ensures that the entire federation supervision hierarchies
(including all federation links) are stopped **before** stopping app
`rabbit` when stopping RabbittMQ.

 ## Why?

Previously, we've seen cases where hundreds of federation links are
stopped during the shutdown procedure in app `rabbit` leading to
federation link restarts happening in parallel to vhosts being stopped.
In one case, the shutdown of app `rabbit` even got stuck (although there
is no evidence that federation was the problem).

Either way, the cleaner appraoch is to gracefully stop all federation
links, i.e. the entire supervision hierarchy under
`rabbit_exchange_federation_sup` and `rabbit_queue_federation_sup`
when stopping the federation apps, i.e. **before** proceeding to stop
app `rabbit`.

 ## How?

The boot step cleanup steps for the federation plugins are skipped when
stopping RabbitMQ.

Hence, this commit ensures that the supervisors are stopped in the
stop/1 application callback.

This commit does something similar to #14054
but uses a simpler approach.

(cherry picked from commit 8bffa58)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants