Skip to content

AWS peer discovery: filter EC2 instances by state#15388

Merged
michaelklishin merged 4 commits intorabbitmq:mainfrom
amazon-mq:fix/aws-peer-discovery-instance-states
Feb 3, 2026
Merged

AWS peer discovery: filter EC2 instances by state#15388
michaelklishin merged 4 commits intorabbitmq:mainfrom
amazon-mq:fix/aws-peer-discovery-instance-states

Conversation

@lukebakken
Copy link
Copy Markdown
Collaborator

Summary

Adds instance state filtering to the AWS peer discovery plugin to prevent dead or dying EC2 instances from being included in cluster membership during node joins. This addresses a race condition where a new node joining the cluster could sync dead nodes from existing members before a removal command completes.

Problem

When using AWS peer discovery with AutoScaling or EC2 tags, the ec2:DescribeInstances API returns instances in all states, including stopping, stopped, shutting-down, and terminated. During instance replacement, this can cause a race condition:

  1. Old instance enters stopping or shutting-down state
  2. New instance starts and calls list_nodes()
  3. DescribeInstances returns both old (dying) and new (running) instances
  4. New node joins cluster with dead node in membership
  5. Removal command issued later fails because dead node was already re-added

This race condition was observed where a node removal failed because the new node's Mnesia schema sync re-added the dead node 318ms before the removal command was issued.

Solution

Adds a new aws_ec2_instance_states configuration option that filters EC2 instances by state during peer discovery. The default value is ["running", "pending"], which excludes dead or dying instances while including operational and starting instances.

Configuration

Via rabbitmq.conf:

cluster_formation.aws.ec2_instance_states.1 = running
cluster_formation.aws.ec2_instance_states.2 = pending

Via environment variable:

AWS_EC2_INSTANCE_STATES="running,pending"

Via advanced.config:

[{rabbit, [
   {cluster_formation, [
       {peer_discovery_aws, [
           {aws_ec2_instance_states, [running, pending]}
       ]}
   ]}
]}].

Implementation Details

  • Adds Filter.N.Name=instance-state-name with multiple values to ec2:DescribeInstances API calls
  • Applies to both AutoScaling mode (get_hostname_by_instance_ids/2) and tag-based mode (get_hostname_by_tags/1)
  • Validates configured states against AWS EC2 valid states, filtering out invalid values with a warning
  • Normalizes both atom and string inputs for consistent handling
  • Filter format verified against AWS EC2 API specification and tested with live AWS API

Testing

  • Unit tests for filter building logic
  • Integration tests with mocked EC2 API calls for both discovery modes
  • Validation tests for both string and atom inputs
  • Configuration schema tests for rabbitmq.conf parsing

Backward Compatibility

  • Default behavior changes to filter by running or pending state (safer default)
  • Users can configure empty list [] to disable filtering (restore old behavior)
  • Existing configurations without this setting get the safe default

Adds a new configuration option to filter EC2 instances by state during
peer discovery. This prevents dead or dying instances from being included
in cluster membership during node joins.

The configuration accepts a list of instance state names to include in
discovery results. The default value is `["running", "pending"]`, which
excludes instances in `stopping`, `stopped`, `shutting-down`, and
`terminated` states.

Configuration can be set via `rabbitmq.conf`:

    cluster_formation.aws.ec2_instance_states.1 = running
    cluster_formation.aws.ec2_instance_states.2 = pending

Or via environment variable:

    AWS_EC2_INSTANCE_STATES="running,pending"

The schema file includes cuttlefish mappings to support the
`rabbitmq.conf` syntax, and test snippets verify the configuration
parsing works correctly for both single and multiple state values.
Implements instance state filtering for both AutoScaling and tag-based
discovery modes. The `ec2:DescribeInstances` API calls now include
filters based on the `aws_ec2_instance_states` configuration.

The implementation adds `maybe_add_instance_state_filters/2` which
checks the configuration and conditionally applies state filters. When
states are configured, `add_instance_state_filters/3` builds a single
filter with multiple values in the format:

    Filter.N.Name=instance-state-name
    Filter.N.Value.1=running
    Filter.N.Value.2=pending

This format matches the AWS EC2 API specification and was verified with
the AWS CLI. The default configuration filters to `running` and
`pending` instances, excluding dead or dying instances (`stopping`,
`stopped`, `shutting-down`, `terminated`).

Both `get_hostname_by_instance_ids/2` (AutoScaling mode) and
`get_hostname_by_tags/1` (tag-based mode) apply the state filters
consistently. The filters are applied after tag filters and before the
final query string is built.

A unit test verifies the filter format is correct and matches the
expected AWS API parameter structure.
Adds tests for `get_hostname_by_instance_ids/2` and
`get_hostname_by_tags/1` to verify instance state filters are correctly
applied in both AutoScaling and tag-based discovery modes.

The tests use `meck` to mock `rabbitmq_aws:api_get_request/2`, avoiding
real EC2 API calls while verifying the query string format. Each test
confirms that:

- Instance state filters are present in the request path
- Filter format matches AWS API specification
- Tag filters and state filters work together correctly
- Hostname extraction from mock responses works as expected

The `get_hostname_by_tags_with_state_filter` test checks for
URL-encoded tag filters (`tag%3Aservice`) since the query string is
URL-encoded before the API call.

A helper function `mock_describe_instances_response/0` provides properly
formatted EC2 API response data for testing.
Adds validation for the `aws_ec2_instance_states` configuration to
ensure only valid EC2 instance state names are used. The validation
filters out invalid states and logs a warning, allowing the node to
start with the valid states.

The `validate_instance_states/1` function checks each configured state
against the list of valid EC2 instance states defined in the
`?VALID_EC2_INSTANCE_STATES` macro. Invalid states are discarded and
logged as a warning.

The `normalize_state/1` function handles both atom and string inputs,
converting atoms to strings for consistent handling. This supports
configuration via `advanced.config` with atoms (`[running, pending]`)
or via `rabbitmq.conf` with strings (`["running", "pending"]`).

Tests verify validation works correctly for:
- All valid states (strings)
- Mixed valid and invalid states (filters out invalid)
- Atom inputs (normalizes to strings)
- Mixed valid and invalid atoms
@lukebakken lukebakken self-assigned this Feb 3, 2026
@michaelklishin michaelklishin added this to the 4.3.0 milestone Feb 3, 2026
@michaelklishin michaelklishin changed the title Filter EC2 instances by state in AWS peer discovery AWS peer discovery: filter EC2 instances by state Feb 3, 2026
@michaelklishin michaelklishin merged commit 00ac2ae into rabbitmq:main Feb 3, 2026
289 checks passed
michaelklishin added a commit that referenced this pull request Feb 3, 2026
AWS peer discovery: filter EC2 instances by state (backport #15388)
@lukebakken lukebakken deleted the fix/aws-peer-discovery-instance-states branch February 4, 2026 00:56
@lukebakken
Copy link
Copy Markdown
Collaborator Author

Thank you @michaelklishin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants