AWS peer discovery: filter EC2 instances by state#15388
Merged
michaelklishin merged 4 commits intorabbitmq:mainfrom Feb 3, 2026
Merged
AWS peer discovery: filter EC2 instances by state#15388michaelklishin merged 4 commits intorabbitmq:mainfrom
michaelklishin merged 4 commits intorabbitmq:mainfrom
Conversation
Adds a new configuration option to filter EC2 instances by state during
peer discovery. This prevents dead or dying instances from being included
in cluster membership during node joins.
The configuration accepts a list of instance state names to include in
discovery results. The default value is `["running", "pending"]`, which
excludes instances in `stopping`, `stopped`, `shutting-down`, and
`terminated` states.
Configuration can be set via `rabbitmq.conf`:
cluster_formation.aws.ec2_instance_states.1 = running
cluster_formation.aws.ec2_instance_states.2 = pending
Or via environment variable:
AWS_EC2_INSTANCE_STATES="running,pending"
The schema file includes cuttlefish mappings to support the
`rabbitmq.conf` syntax, and test snippets verify the configuration
parsing works correctly for both single and multiple state values.
Implements instance state filtering for both AutoScaling and tag-based
discovery modes. The `ec2:DescribeInstances` API calls now include
filters based on the `aws_ec2_instance_states` configuration.
The implementation adds `maybe_add_instance_state_filters/2` which
checks the configuration and conditionally applies state filters. When
states are configured, `add_instance_state_filters/3` builds a single
filter with multiple values in the format:
Filter.N.Name=instance-state-name
Filter.N.Value.1=running
Filter.N.Value.2=pending
This format matches the AWS EC2 API specification and was verified with
the AWS CLI. The default configuration filters to `running` and
`pending` instances, excluding dead or dying instances (`stopping`,
`stopped`, `shutting-down`, `terminated`).
Both `get_hostname_by_instance_ids/2` (AutoScaling mode) and
`get_hostname_by_tags/1` (tag-based mode) apply the state filters
consistently. The filters are applied after tag filters and before the
final query string is built.
A unit test verifies the filter format is correct and matches the
expected AWS API parameter structure.
Adds tests for `get_hostname_by_instance_ids/2` and `get_hostname_by_tags/1` to verify instance state filters are correctly applied in both AutoScaling and tag-based discovery modes. The tests use `meck` to mock `rabbitmq_aws:api_get_request/2`, avoiding real EC2 API calls while verifying the query string format. Each test confirms that: - Instance state filters are present in the request path - Filter format matches AWS API specification - Tag filters and state filters work together correctly - Hostname extraction from mock responses works as expected The `get_hostname_by_tags_with_state_filter` test checks for URL-encoded tag filters (`tag%3Aservice`) since the query string is URL-encoded before the API call. A helper function `mock_describe_instances_response/0` provides properly formatted EC2 API response data for testing.
Adds validation for the `aws_ec2_instance_states` configuration to ensure only valid EC2 instance state names are used. The validation filters out invalid states and logs a warning, allowing the node to start with the valid states. The `validate_instance_states/1` function checks each configured state against the list of valid EC2 instance states defined in the `?VALID_EC2_INSTANCE_STATES` macro. Invalid states are discarded and logged as a warning. The `normalize_state/1` function handles both atom and string inputs, converting atoms to strings for consistent handling. This supports configuration via `advanced.config` with atoms (`[running, pending]`) or via `rabbitmq.conf` with strings (`["running", "pending"]`). Tests verify validation works correctly for: - All valid states (strings) - Mixed valid and invalid states (filters out invalid) - Atom inputs (normalizes to strings) - Mixed valid and invalid atoms
the-mikedavis
approved these changes
Feb 3, 2026
michaelklishin
approved these changes
Feb 3, 2026
michaelklishin
added a commit
that referenced
this pull request
Feb 3, 2026
AWS peer discovery: filter EC2 instances by state (backport #15388)
Collaborator
Author
|
Thank you @michaelklishin |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds instance state filtering to the AWS peer discovery plugin to prevent dead or dying EC2 instances from being included in cluster membership during node joins. This addresses a race condition where a new node joining the cluster could sync dead nodes from existing members before a removal command completes.
Problem
When using AWS peer discovery with AutoScaling or EC2 tags, the
ec2:DescribeInstancesAPI returns instances in all states, includingstopping,stopped,shutting-down, andterminated. During instance replacement, this can cause a race condition:stoppingorshutting-downstatelist_nodes()DescribeInstancesreturns both old (dying) and new (running) instancesThis race condition was observed where a node removal failed because the new node's Mnesia schema sync re-added the dead node 318ms before the removal command was issued.
Solution
Adds a new
aws_ec2_instance_statesconfiguration option that filters EC2 instances by state during peer discovery. The default value is["running", "pending"], which excludes dead or dying instances while including operational and starting instances.Configuration
Via
rabbitmq.conf:Via environment variable:
AWS_EC2_INSTANCE_STATES="running,pending"Via
advanced.config:[{rabbit, [ {cluster_formation, [ {peer_discovery_aws, [ {aws_ec2_instance_states, [running, pending]} ]} ]} ]}].Implementation Details
Filter.N.Name=instance-state-namewith multiple values toec2:DescribeInstancesAPI callsget_hostname_by_instance_ids/2) and tag-based mode (get_hostname_by_tags/1)Testing
rabbitmq.confparsingBackward Compatibility
runningorpendingstate (safer default)[]to disable filtering (restore old behavior)