-
Notifications
You must be signed in to change notification settings - Fork 27.4k
pytorch_linux_xenial_py3_6_gcc5_4_test failing due to cert issues #65931
Description
Current status: mitigated, rebase on master to get the fix #65934.
Error looks like:
Fetched 4466 kB in 4s (1113 kB/s)
Reading package lists...
W: The repository 'https://deb.nodesource.com/node_12.x xenial Release' does not have a Release file.
E: Failed to fetch https://deb.nodesource.com/node_12.x/dists/xenial/main/source/Sources server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
E: Some index files failed to download. They have been ignored, or old ones used instead.
Incident timeline (times pacific)
- 10/30 7:27a Issues started: https://fburl.com/scuba/opensource_ci_jobs/h75x17or
- ~10a: Detected by looking at HUD.
- 10:12a: Googling surfaces an active issue in the package upstream The certificate for deb.nodesource seems to be expired nodesource/distributions#1266
- 11:24a: Mitigation landed [ci] try installing libgnutls to fix cert error #65934
User impact
Consistent failure of pytorch_linux_xenial_py3_6_gcc5_4_test, which is present on master and PR CI.
Root cause
Let's Encrypt's root certificate expired on 9/30 (helpful explanation). This broke our calls to the nodesource package repository, which we were using to install some tools at test time.
Mitigation
The issue was mitigated on trunk by forcing an update of gnutls, which contains the new root certificates.
Prevention/followups
Pre-installing dependencies in docker. We should not be installing anything during the core build and test phase. While eventually docker rebuilds would fail and we would have to apply the same mitigation, this is much less urgent than a CI breakage.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @seemethere @malfet @pytorch/pytorch-dev-infra