Skip to content

Transient failures in instance state monitor cause instances to be lost at sea #2727

@gjcolombo

Description

@gjcolombo

Sled agent spawns a state monitor task per instance that calls Propolis's instance_state_monitor endpoint, watches for VM state transitions, processes them, and relays any resulting instance state changes to Nexus. Any error in Instance::monitor_state_task or any of its callees bubbles up through the task and causes it to exit. After this, nothing monitors Propolis for subsequent state changes or notifies Nexus if they occur.

The DNS resolution error in #2726 brought this to my immediate attention, but any failure in the monitoring task will do the trick; another not-too-esoteric one is a Propolis panic that causes the call to instance_state_monitor to fail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions