Skip to content

Failing to create ec2 instances -> lots of queueing #159651

@clee2000

Description

@clee2000

NOTE: Remember to label this issue with "ci: sev"

Current Status

Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase).

mitigated

Error looks like

Provide some way users can tell that this SEV is causing their issue.

Jobs queueing/not running

Incident timeline (all times pacific)

Include when the incident began, when it was detected, mitigated, root caused, and finally closed.

Thursday

  • 6:00pm higher than usual number of failures in scale up can be seen in the cloudwatch dashboard
  • 7:00pm aws cloudwatch alert sent in chat

Friday

  • 12:00am first queue alert fired for linux.2xlarge, 88 machines, 0.43 hours?
  • 8:53am I notice recent queueing alerts and see that linux.2xlarge has a queue on the HUD metrics page

User impact

How does this affect users of PyTorch CI?

Jobs queueing/not running

Root cause

What was the root cause of this issue?

Mitigation

How did we mitigate the issue?
Mitigation 1: Redirected traffic to LF runners
Mitigation 2: Temporarily disabled rules that were blocking scale up

Prevention/followups

How do we prevent issues like this in the future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci: sevcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions