AWS peer discovery: filter EC2 instances by state by lukebakken · Pull Request #15388 · rabbitmq/rabbitmq-server

lukebakken · 2026-02-03T19:41:31Z

Summary

Adds instance state filtering to the AWS peer discovery plugin to prevent dead or dying EC2 instances from being included in cluster membership during node joins. This addresses a race condition where a new node joining the cluster could sync dead nodes from existing members before a removal command completes.

Problem

When using AWS peer discovery with AutoScaling or EC2 tags, the ec2:DescribeInstances API returns instances in all states, including stopping, stopped, shutting-down, and terminated. During instance replacement, this can cause a race condition:

Old instance enters stopping or shutting-down state
New instance starts and calls list_nodes()
DescribeInstances returns both old (dying) and new (running) instances
New node joins cluster with dead node in membership
Removal command issued later fails because dead node was already re-added

This race condition was observed where a node removal failed because the new node's Mnesia schema sync re-added the dead node 318ms before the removal command was issued.

Solution

Adds a new aws_ec2_instance_states configuration option that filters EC2 instances by state during peer discovery. The default value is ["running", "pending"], which excludes dead or dying instances while including operational and starting instances.

Configuration

Via rabbitmq.conf:

cluster_formation.aws.ec2_instance_states.1 = running
cluster_formation.aws.ec2_instance_states.2 = pending

Via environment variable:

AWS_EC2_INSTANCE_STATES="running,pending"

Via advanced.config:

[{rabbit, [
   {cluster_formation, [
       {peer_discovery_aws, [
           {aws_ec2_instance_states, [running, pending]}
       ]}
   ]}
]}].

Implementation Details

Adds Filter.N.Name=instance-state-name with multiple values to ec2:DescribeInstances API calls
Applies to both AutoScaling mode (get_hostname_by_instance_ids/2) and tag-based mode (get_hostname_by_tags/1)
Validates configured states against AWS EC2 valid states, filtering out invalid values with a warning
Normalizes both atom and string inputs for consistent handling
Filter format verified against AWS EC2 API specification and tested with live AWS API

Testing

Unit tests for filter building logic
Integration tests with mocked EC2 API calls for both discovery modes
Validation tests for both string and atom inputs
Configuration schema tests for rabbitmq.conf parsing

Backward Compatibility

Default behavior changes to filter by running or pending state (safer default)
Users can configure empty list [] to disable filtering (restore old behavior)
Existing configurations without this setting get the safe default

Adds a new configuration option to filter EC2 instances by state during peer discovery. This prevents dead or dying instances from being included in cluster membership during node joins. The configuration accepts a list of instance state names to include in discovery results. The default value is `["running", "pending"]`, which excludes instances in `stopping`, `stopped`, `shutting-down`, and `terminated` states. Configuration can be set via `rabbitmq.conf`: cluster_formation.aws.ec2_instance_states.1 = running cluster_formation.aws.ec2_instance_states.2 = pending Or via environment variable: AWS_EC2_INSTANCE_STATES="running,pending" The schema file includes cuttlefish mappings to support the `rabbitmq.conf` syntax, and test snippets verify the configuration parsing works correctly for both single and multiple state values.

Implements instance state filtering for both AutoScaling and tag-based discovery modes. The `ec2:DescribeInstances` API calls now include filters based on the `aws_ec2_instance_states` configuration. The implementation adds `maybe_add_instance_state_filters/2` which checks the configuration and conditionally applies state filters. When states are configured, `add_instance_state_filters/3` builds a single filter with multiple values in the format: Filter.N.Name=instance-state-name Filter.N.Value.1=running Filter.N.Value.2=pending This format matches the AWS EC2 API specification and was verified with the AWS CLI. The default configuration filters to `running` and `pending` instances, excluding dead or dying instances (`stopping`, `stopped`, `shutting-down`, `terminated`). Both `get_hostname_by_instance_ids/2` (AutoScaling mode) and `get_hostname_by_tags/1` (tag-based mode) apply the state filters consistently. The filters are applied after tag filters and before the final query string is built. A unit test verifies the filter format is correct and matches the expected AWS API parameter structure.

Adds tests for `get_hostname_by_instance_ids/2` and `get_hostname_by_tags/1` to verify instance state filters are correctly applied in both AutoScaling and tag-based discovery modes. The tests use `meck` to mock `rabbitmq_aws:api_get_request/2`, avoiding real EC2 API calls while verifying the query string format. Each test confirms that: - Instance state filters are present in the request path - Filter format matches AWS API specification - Tag filters and state filters work together correctly - Hostname extraction from mock responses works as expected The `get_hostname_by_tags_with_state_filter` test checks for URL-encoded tag filters (`tag%3Aservice`) since the query string is URL-encoded before the API call. A helper function `mock_describe_instances_response/0` provides properly formatted EC2 API response data for testing.

Adds validation for the `aws_ec2_instance_states` configuration to ensure only valid EC2 instance state names are used. The validation filters out invalid states and logs a warning, allowing the node to start with the valid states. The `validate_instance_states/1` function checks each configured state against the list of valid EC2 instance states defined in the `?VALID_EC2_INSTANCE_STATES` macro. Invalid states are discarded and logged as a warning. The `normalize_state/1` function handles both atom and string inputs, converting atoms to strings for consistent handling. This supports configuration via `advanced.config` with atoms (`[running, pending]`) or via `rabbitmq.conf` with strings (`["running", "pending"]`). Tests verify validation works correctly for: - All valid states (strings) - Mixed valid and invalid states (filters out invalid) - Atom inputs (normalizes to strings) - Mixed valid and invalid atoms

AWS peer discovery: filter EC2 instances by state (backport #15388)

lukebakken · 2026-02-04T00:56:39Z

Thank you @michaelklishin

lukebakken added 4 commits February 3, 2026 19:29

lukebakken requested review from michaelklishin and the-mikedavis February 3, 2026 19:41

lukebakken self-assigned this Feb 3, 2026

michaelklishin added backport-v4.2.x bug usability labels Feb 3, 2026

michaelklishin added this to the 4.3.0 milestone Feb 3, 2026

michaelklishin changed the title ~~Filter EC2 instances by state in AWS peer discovery~~ AWS peer discovery: filter EC2 instances by state Feb 3, 2026

the-mikedavis approved these changes Feb 3, 2026

View reviewed changes

michaelklishin approved these changes Feb 3, 2026

View reviewed changes

michaelklishin merged commit 00ac2ae into rabbitmq:main Feb 3, 2026
289 checks passed

mergify bot mentioned this pull request Feb 3, 2026

AWS peer discovery: filter EC2 instances by state (backport #15388) #15389

Merged

michaelklishin added a commit that referenced this pull request Feb 3, 2026

Merge pull request #15389 from rabbitmq/mergify/bp/v4.2.x/pr-15388

006c461

AWS peer discovery: filter EC2 instances by state (backport #15388)

lukebakken deleted the fix/aws-peer-discovery-instance-states branch February 4, 2026 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS peer discovery: filter EC2 instances by state#15388

AWS peer discovery: filter EC2 instances by state#15388
michaelklishin merged 4 commits intorabbitmq:mainfrom
amazon-mq:fix/aws-peer-discovery-instance-states

lukebakken commented Feb 3, 2026

Uh oh!

Uh oh!

lukebakken commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukebakken commented Feb 3, 2026

Summary

Problem

Solution

Configuration

Implementation Details

Testing

Backward Compatibility

Uh oh!

Uh oh!

lukebakken commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants