-
Notifications
You must be signed in to change notification settings - Fork 27.7k
Failing to create ec2 instances -> lots of queueing #159651
Copy link
Copy link
Closed
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"This label marks a sev as mitigated and suppress "ci: sev"triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CIci: sev-mitigatedThis label marks a sev as mitigated and suppress "ci: sev"This label marks a sev as mitigated and suppress "ci: sev"triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done
Current Status
Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase).
mitigated
Error looks like
Provide some way users can tell that this SEV is causing their issue.
Jobs queueing/not running
Incident timeline (all times pacific)
Include when the incident began, when it was detected, mitigated, root caused, and finally closed.
Thursday
Friday
User impact
How does this affect users of PyTorch CI?
Jobs queueing/not running
Root cause
What was the root cause of this issue?
Mitigation
How did we mitigate the issue?
Mitigation 1: Redirected traffic to LF runners
Mitigation 2: Temporarily disabled rules that were blocking scale up
Prevention/followups
How do we prevent issues like this in the future?