Failing to create ec2 instances -> lots of queueing

> NOTE: Remember to label this issue with "`ci: sev`"

 

## Current Status
*Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.

mitigated

## Error looks like
*Provide some way users can tell that this SEV is causing their issue.*

Jobs queueing/not running

## Incident timeline (all times pacific)
*Include when the incident began, when it was detected, mitigated, root caused, and finally closed.*

Thursday
* 6:00pm higher than usual number of failures in scale up can be seen in the cloudwatch dashboard
* 7:00pm aws cloudwatch alert sent in chat

Friday
* 12:00am first queue alert fired for linux.2xlarge, 88 machines, 0.43 hours?
* 8:53am I notice recent queueing alerts and see that linux.2xlarge has a queue on the HUD metrics page


## User impact
*How does this affect users of PyTorch CI?*

Jobs queueing/not running

## Root cause
*What was the root cause of this issue?*

## Mitigation
*How did we mitigate the issue?*
Mitigation 1: Redirected traffic to LF runners
Mitigation 2: Temporarily disabled rules that were blocking scale up

## Prevention/followups
*How do we prevent issues like this in the future?*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to create ec2 instances -> lots of queueing #159651

Current Status

Error looks like

Incident timeline (all times pacific)

User impact

Root cause

Mitigation

Prevention/followups

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failing to create ec2 instances -> lots of queueing #159651

Description

Current Status

Error looks like

Incident timeline (all times pacific)

User impact

Root cause

Mitigation

Prevention/followups

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions