Skip to content

Handle empty list case in mirrored_supervisor:child/2#15229

Merged
michaelklishin merged 1 commit intorabbitmq:mainfrom
amazon-mq:fix-mirrored-supervisor-delete-child-race
Jan 8, 2026
Merged

Handle empty list case in mirrored_supervisor:child/2#15229
michaelklishin merged 1 commit intorabbitmq:mainfrom
amazon-mq:fix-mirrored-supervisor-delete-child-race

Conversation

@lukebakken
Copy link
Copy Markdown
Collaborator

During production testing of amazon-mq/rabbitmq-queue-migration, a badmatch exception was observed during shovel cleanup:

exit:{{{badmatch,[]},[{mirrored_supervisor,child,2,...}]},
     {gen_server2,call,[<0.1346.0>,{delete_child,...},infinity]}}

The exception occurs in mirrored_supervisor:child/2 when the list comprehension returns an empty list instead of a single-element list. The function uses pattern matching [Pid] = [...] which fails when no matching child is found in the supervisor's children list.

This change updates child/2 to use a case statement that returns undefined when the list is empty, matching the behavior expected by check_stop/3 which already handles undefined as "child not found". The empty list case is safe to treat as undefined because it indicates the child has already been removed from the supervisor, which is the desired end state for deletion operations.

While we could not reliably reproduce the race condition in testing, the fix is defensive and aligns with how terminate_child can return {error, not_found} when a child doesn't exist. This change makes delete_child operations more robust.

During production testing of `amazon-mq/rabbitmq-queue-migration`, a
badmatch exception was observed during shovel cleanup:

```
exit:{{{badmatch,[]},[{mirrored_supervisor,child,2,...}]},
     {gen_server2,call,[<0.1346.0>,{delete_child,...},infinity]}}
```

The exception occurs in `mirrored_supervisor:child/2` when the list
comprehension returns an empty list instead of a single-element list.
The function uses pattern matching `[Pid] = [...]` which fails when no
matching child is found in the supervisor's children list.

This change updates `child/2` to use a case statement that returns
`undefined` when the list is empty, matching the behavior expected by
`check_stop/3` which already handles `undefined` as "child not found".
The empty list case is safe to treat as `undefined` because it indicates
the child has already been removed from the supervisor, which is the
desired end state for deletion operations.

While we could not reliably reproduce the race condition in testing, the
fix is defensive and aligns with how `terminate_child` can return
`{error, not_found}` when a child doesn't exist. This change makes
`delete_child` operations more robust when children are removed through
other means (supervisor EXIT handling, distributed coordination, etc).
@michaelklishin michaelklishin merged commit c340c2a into rabbitmq:main Jan 8, 2026
575 of 577 checks passed
@lukebakken lukebakken deleted the fix-mirrored-supervisor-delete-child-race branch January 8, 2026 21:45
@lukebakken
Copy link
Copy Markdown
Collaborator Author

Thank you @michaelklishin and @the-mikedavis

michaelklishin added a commit that referenced this pull request Jan 9, 2026
Handle empty list case in `mirrored_supervisor:child/2` (backport #15229)
michaelklishin added a commit that referenced this pull request Jan 9, 2026
Handle empty list case in `mirrored_supervisor:child/2` (backport #15229) (backport #15231)
@lukebakken
Copy link
Copy Markdown
Collaborator Author

Just FYI, it turns out that if this badmatch is hit frequently enough, it can cause the supervisor to exceed its restart intensity which equals 💥 ... no more shovels.

lukebakken added a commit to amazon-mq/rabbitmq-queue-migration that referenced this pull request Mar 27, 2026
HTTP_API.md:
- Fix 404 error response body: was {"error": "Object Not Found",
  "reason": "Not Found"}, actually {"error": "Migration not found"}
- Add missing instance_id field to snapshot response examples and
  field description list
- Document all vhost response shape for check endpoint: returns
  {"vhost": "all", "vhost_results": [...]} not the single-vhost shape
- Add active_alarms and memory_usage to system_checks response example
  and System Check Types list
- Fix concurrent migration error: remove incorrect 409 status code row,
  fix error body to {"error": "bad_request", "reason": "Migration
  validation failed: in_progress"}
- Fix Validation Failed, No Eligible Queues, and Insufficient Disk
  Space error bodies to match actual rqm_mgmt.erl output
- Remove invalid Parameter error example: batch_size=-10 is silently
  ignored by the parser, not rejected with a 400
- Remove internal AGENTS.md link from See Also section

API_EXAMPLES.md:
- Add missing instance_id field to snapshot response example
- Add active_alarms and memory_usage to system_checks response example
- Replace invalid unsynchronized queue issue type in compat checker
  results (unsynchronized is a system-level check, not a per-queue
  issue type); replace with queue_expires example
- Fix unsuitable_overflow and too_many_queues reason strings to match
  actual code output
- Add missing queue_expires and message_ttl to Skip Reasons list
- Fix concurrent migration error body
- Fix Migration Not Found 404 body

CONFIGURATION.md:
- Add missing usage example for shovel_prefetch_count

INTEGRATION_TESTING.md:
- Add missing quorum_queue.property_equivalence.relaxed_checks_on_redeclaration
  and queue_migration.snapshot_mode to cluster configuration example

MIGRATION_GUIDE.md:
- Remove "gracefully" from connection closing description: connections
  are closed by stopping TCP listeners, not via graceful handshake

SKIP_UNSUITABLE_QUEUES.md:
- Fix broken link: INTEGRATION_TESTS.md -> INTEGRATION_TESTING.md

TROUBLESHOOTING.md:
- Remove duplicate "Completed queues remain as quorum queues" line
- Document root cause of shovel noproc failure: race condition in
  mirrored_supervisor:child/2 that can exhaust shovel supervisor
  restart intensity; reference upstream fix
  rabbitmq/rabbitmq-server#15229 (merged into 4.1.x+); note that
  Amazon MQ for RabbitMQ includes this fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants