Skip to content

AWS peer discovery: filter EC2 instances by state (backport #15388)#15389

Merged
michaelklishin merged 4 commits intov4.2.xfrom
mergify/bp/v4.2.x/pr-15388
Feb 3, 2026
Merged

AWS peer discovery: filter EC2 instances by state (backport #15388)#15389
michaelklishin merged 4 commits intov4.2.xfrom
mergify/bp/v4.2.x/pr-15388

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify bot commented Feb 3, 2026

Summary

Adds instance state filtering to the AWS peer discovery plugin to prevent dead or dying EC2 instances from being included in cluster membership during node joins. This addresses a race condition where a new node joining the cluster could sync dead nodes from existing members before a removal command completes.

Problem

When using AWS peer discovery with AutoScaling or EC2 tags, the ec2:DescribeInstances API returns instances in all states, including stopping, stopped, shutting-down, and terminated. During instance replacement, this can cause a race condition:

  1. Old instance enters stopping or shutting-down state
  2. New instance starts and calls list_nodes()
  3. DescribeInstances returns both old (dying) and new (running) instances
  4. New node joins cluster with dead node in membership
  5. Removal command issued later fails because dead node was already re-added

This race condition was observed where a node removal failed because the new node's Mnesia schema sync re-added the dead node 318ms before the removal command was issued.

Solution

Adds a new aws_ec2_instance_states configuration option that filters EC2 instances by state during peer discovery. The default value is ["running", "pending"], which excludes dead or dying instances while including operational and starting instances.

Configuration

Via rabbitmq.conf:

cluster_formation.aws.ec2_instance_states.1 = running
cluster_formation.aws.ec2_instance_states.2 = pending

Via environment variable:

AWS_EC2_INSTANCE_STATES="running,pending"

Via advanced.config:

[{rabbit, [
   {cluster_formation, [
       {peer_discovery_aws, [
           {aws_ec2_instance_states, [running, pending]}
       ]}
   ]}
]}].

Implementation Details

  • Adds Filter.N.Name=instance-state-name with multiple values to ec2:DescribeInstances API calls
  • Applies to both AutoScaling mode (get_hostname_by_instance_ids/2) and tag-based mode (get_hostname_by_tags/1)
  • Validates configured states against AWS EC2 valid states, filtering out invalid values with a warning
  • Normalizes both atom and string inputs for consistent handling
  • Filter format verified against AWS EC2 API specification and tested with live AWS API

Testing

  • Unit tests for filter building logic
  • Integration tests with mocked EC2 API calls for both discovery modes
  • Validation tests for both string and atom inputs
  • Configuration schema tests for rabbitmq.conf parsing

Backward Compatibility

  • Default behavior changes to filter by running or pending state (safer default)
  • Users can configure empty list [] to disable filtering (restore old behavior)
  • Existing configurations without this setting get the safe default
    This is an automatic backport of pull request AWS peer discovery: filter EC2 instances by state #15388 done by Mergify.

Adds a new configuration option to filter EC2 instances by state during
peer discovery. This prevents dead or dying instances from being included
in cluster membership during node joins.

The configuration accepts a list of instance state names to include in
discovery results. The default value is `["running", "pending"]`, which
excludes instances in `stopping`, `stopped`, `shutting-down`, and
`terminated` states.

Configuration can be set via `rabbitmq.conf`:

    cluster_formation.aws.ec2_instance_states.1 = running
    cluster_formation.aws.ec2_instance_states.2 = pending

Or via environment variable:

    AWS_EC2_INSTANCE_STATES="running,pending"

The schema file includes cuttlefish mappings to support the
`rabbitmq.conf` syntax, and test snippets verify the configuration
parsing works correctly for both single and multiple state values.

(cherry picked from commit a3d80dd)
Implements instance state filtering for both AutoScaling and tag-based
discovery modes. The `ec2:DescribeInstances` API calls now include
filters based on the `aws_ec2_instance_states` configuration.

The implementation adds `maybe_add_instance_state_filters/2` which
checks the configuration and conditionally applies state filters. When
states are configured, `add_instance_state_filters/3` builds a single
filter with multiple values in the format:

    Filter.N.Name=instance-state-name
    Filter.N.Value.1=running
    Filter.N.Value.2=pending

This format matches the AWS EC2 API specification and was verified with
the AWS CLI. The default configuration filters to `running` and
`pending` instances, excluding dead or dying instances (`stopping`,
`stopped`, `shutting-down`, `terminated`).

Both `get_hostname_by_instance_ids/2` (AutoScaling mode) and
`get_hostname_by_tags/1` (tag-based mode) apply the state filters
consistently. The filters are applied after tag filters and before the
final query string is built.

A unit test verifies the filter format is correct and matches the
expected AWS API parameter structure.

(cherry picked from commit 1a4da14)
Adds tests for `get_hostname_by_instance_ids/2` and
`get_hostname_by_tags/1` to verify instance state filters are correctly
applied in both AutoScaling and tag-based discovery modes.

The tests use `meck` to mock `rabbitmq_aws:api_get_request/2`, avoiding
real EC2 API calls while verifying the query string format. Each test
confirms that:

- Instance state filters are present in the request path
- Filter format matches AWS API specification
- Tag filters and state filters work together correctly
- Hostname extraction from mock responses works as expected

The `get_hostname_by_tags_with_state_filter` test checks for
URL-encoded tag filters (`tag%3Aservice`) since the query string is
URL-encoded before the API call.

A helper function `mock_describe_instances_response/0` provides properly
formatted EC2 API response data for testing.

(cherry picked from commit d459bef)
Adds validation for the `aws_ec2_instance_states` configuration to
ensure only valid EC2 instance state names are used. The validation
filters out invalid states and logs a warning, allowing the node to
start with the valid states.

The `validate_instance_states/1` function checks each configured state
against the list of valid EC2 instance states defined in the
`?VALID_EC2_INSTANCE_STATES` macro. Invalid states are discarded and
logged as a warning.

The `normalize_state/1` function handles both atom and string inputs,
converting atoms to strings for consistent handling. This supports
configuration via `advanced.config` with atoms (`[running, pending]`)
or via `rabbitmq.conf` with strings (`["running", "pending"]`).

Tests verify validation works correctly for:
- All valid states (strings)
- Mixed valid and invalid states (filters out invalid)
- Atom inputs (normalizes to strings)
- Mixed valid and invalid atoms

(cherry picked from commit 41d5030)
@michaelklishin michaelklishin added this to the 4.2.4 milestone Feb 3, 2026
@michaelklishin michaelklishin merged commit 006c461 into v4.2.x Feb 3, 2026
289 of 290 checks passed
@michaelklishin michaelklishin deleted the mergify/bp/v4.2.x/pr-15388 branch February 3, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants