Skip to content

Fix federation supervisor crash during upgrade to 4.2.x on multi-node cluster#15252

Merged
michaelklishin merged 2 commits intorabbitmq:mainfrom
cloudamqp:fed_backward_compat_4_2
Jan 13, 2026
Merged

Fix federation supervisor crash during upgrade to 4.2.x on multi-node cluster#15252
michaelklishin merged 2 commits intorabbitmq:mainfrom
cloudamqp:fed_backward_compat_4_2

Conversation

@gomoripeti
Copy link
Copy Markdown
Contributor

@gomoripeti gomoripeti commented Jan 12, 2026

Proposed Changes

In a multi-node cluster after a rolling upgrade from below 4.2 to 4.2
supervisor rabbit_federation_exchange_link_sup_sup crashed because
rabbit_federation_link_sup:start_link had arity 1 until 4.1.x. PR
mirrored supervisor preserves the child definitions which still
include a call with arity 1 (without the link module).

To keep old child specs valid, add back a start_link/1 function in rabbit_federation_link_sup.

Fixes #15239

Run the test with

SECONDARY_DIST=$(PWD)/secondary/rabbitmq_server-4.1.7 make -C  deps/rabbitmq_exchange_federation ct-exchange t=rolling_upgrade:child_id_format

Without the patch the test case rolling_upgrade:child_id_format fails with:

=== Location: [{erpc,call,1366},
              {exchange_SUITE,'-child_id_format/1-fun-5-',675},
              {lists,foreach_1,2310},
              {exchange_SUITE,child_id_format,670},
              {test_server,ts_tc,1794},
              {test_server,run_test_case_eval1,1303},
              {test_server,run_test_case_eval,1235}]
=== === Reason: {exception,
                     {noproc,
                         {gen_server,call,
                             [rabbit_federation_exchange_link_sup_sup,
                              which_children,infinity]}}}

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
This is simply a reminder of what we are going to look for before merging your code.

  • Mandatory: I (or my employer/client) have have signed the CA (see https://github.com/rabbitmq/cla)
  • I have read the CONTRIBUTING.md document
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

In a multi-node cluster after a rolling upgrade from below 4.2 to 4.2
supervisor `rabbit_federation_exchange_link_sup_sup` crashed because
`rabbit_federation_link_sup:start_link` had arity 1 until 4.1.x. PR
mirrored supervisor preserves the child definitions which still
include a call with arity 1 (without the link module).

To keep old child specs valid, add back a start_link/1 function in `rabbit_federation_link_sup`.

Fixes rabbitmq#15239
Without the patch the test case rolling_upgrade:child_id_format fails with:
```
=== Location: [{erpc,call,1366},
              {exchange_SUITE,'-child_id_format/1-fun-5-',675},
              {lists,foreach_1,2310},
              {exchange_SUITE,child_id_format,670},
              {test_server,ts_tc,1794},
              {test_server,run_test_case_eval1,1303},
              {test_server,run_test_case_eval,1235}]
=== === Reason: {exception,
                     {noproc,
                         {gen_server,call,
                             [rabbit_federation_exchange_link_sup_sup,
                              which_children,infinity]}}}
```
@gomoripeti gomoripeti force-pushed the fed_backward_compat_4_2 branch from 532ae82 to 4eca000 Compare January 12, 2026 22:05
@gomoripeti gomoripeti marked this pull request as ready for review January 12, 2026 22:07
@gomoripeti
Copy link
Copy Markdown
Contributor Author

The fix commit makes sense on main as well (as the legacy type spec can be preserved forever during rolling upgrades to future RabbitMQ versions)

But the part in the test case that enables rabbitmq_federation plugin only makes sense if the secondary version is below 4.2 (ie on the v4.2.x branch)

should I manually create two PRs? one for v4.2.x with the test enabling rabbitmq_federation plugin and another for main without this line in the test?

@michaelklishin
Copy link
Copy Markdown
Collaborator

@gomoripeti sure, that works for me.

@michaelklishin michaelklishin modified the milestones: 4.2.3, 4.3.0 Jan 12, 2026
michaelklishin added a commit that referenced this pull request Jan 13, 2026
@michaelklishin michaelklishin merged commit 4eca000 into rabbitmq:main Jan 13, 2026
1141 of 1147 checks passed
@michaelklishin
Copy link
Copy Markdown
Collaborator

@gomoripeti note that I have updated a comment in 04c39b3.

Please submit a new PR for v4.2.x, we have until the EOD Pacific time tomorrow to include into into the upcoming 4.2.3 release.

Thank you.

@gomoripeti
Copy link
Copy Markdown
Contributor Author

ah your comment change is enlightening

so in short the current PR can be automatically backported as is to 4.2

In long:

At the start of the test case on the new nodes this is how plugins look like

Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rmq-ct-rolling_upgrade-5-21216@localhost
 |/
[E*] rabbitmq_exchange_federation 4.2.0+beta.4.202.g61afe63.dirty
[E*] rabbitmq_federation_common   4.2.0+beta.3.17.g24e6825

On the old nodes if secondary is 4.1.x plugins look like

Listing plugins with pattern "fed" ...
WARNING - plugins currently enabled but missing: rabbitmq_exchange_federation

 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rmq-ct-rolling_upgrade-2-21054@localhost
 |/
[  ] rabbitmq_federation            4.1.7
[  ] rabbitmq_federation_management 4.1.7
[  ] rabbitmq_federation_prometheus 4.1.7

(plugins enable/disable commands don't work on the old nodes because of the missing plugin

Enabling plugins on node rmq-ct-rolling_upgrade-2-21054@localhost:
rabbitmq_federation
Error:
{:plugins_not_found, [:rabbitmq_exchange_federation]}

That is why I had to use plugins set and this is why the rabbitmq_federation plugin needs to be enabled when secondary is 4.1.x (ie on the v4.2.x branch)

OTOH on main where secondary is 4.2.x, plugins look like this on old nodes in the beginning of the test case

Listing plugins with pattern "fed" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rmq-ct-rolling_upgrade-2-21054@localhost
 |/
[E*] rabbitmq_exchange_federation   4.2.2
[  ] rabbitmq_federation            4.2.2
[e*] rabbitmq_federation_common     4.2.2
[  ] rabbitmq_federation_management 4.2.2
[  ] rabbitmq_federation_prometheus 4.2.2
[  ] rabbitmq_queue_federation      4.2.2

So it is not necessary to enable rabbitmq_federation plugin (that's why I thought two different PRs are necessary)
But it is harmless as it is available (that is what I did not realise)

Noting that it is not possible to enable rabbitmq_federation on the new nodes as that plugin is not available when rabbitmq is started from the tests of the rabbitmq_exchange_federation

Listing plugins with pattern "fed" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rmq-ct-rolling_upgrade-3-21108@localhost
 |/
[E*] rabbitmq_exchange_federation 4.2.0+beta.4.268.g532ae82
[E*] rabbitmq_federation_common   4.2.0+beta.4.268.gea04171.dirty

@michaelklishin
Copy link
Copy Markdown
Collaborator

@Mergifyio backport v4.2.x

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 13, 2026

backport v4.2.x

✅ Backports have been created

Details

@michaelklishin
Copy link
Copy Markdown
Collaborator

rabbitmq_federation is available on new nodes. It's an umbrella plugin that enables two "new" (separate) federation plugins.

michaelklishin added a commit that referenced this pull request Jan 13, 2026
Fix federation supervisor crash during upgrade to 4.2.x on multi-node cluster (backport #15252)
michaelklishin added a commit that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants