docker: use a short timeout when inspecting images#783
Conversation
Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous `src batch` runs tend to be the culprit here. This seems to be most easily detectable by looking at the behaviour of `docker image inspect`. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place. This commit adds a five second timeout to `docker image inspect` invocations, along with a secret environment variable to adjust this behaviour. If `docker` times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?"). The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.
| // TODO!(sqs): is image id the right thing to use here? it is NOT | ||
| // the digest. but the digest is not calculated for all images | ||
| // (unless they are pulled/pushed from/to a registry), see | ||
| // https://github.com/moby/moby/issues/32016. |
There was a problem hiding this comment.
This TODO was outdated: there's actually a lengthy comment earlier in the file explaining what "digest" means in this context, and how it relates to the public digest. Whoever last touched this really should have removed this then. 😬
courier-new
left a comment
There was a problem hiding this comment.
This seems like a great improvement. And worst case if 5s is too sensitive, it's fortunately very easy to release a patch release for src-cli to bump it up. 🙂
| // of the timeout, and we don't want to slow down the test, so we're | ||
| // going to construct a context that has already exceeded its deadline | ||
| // at the point it is provided to Digest. | ||
| ctx, cancel := context.WithTimeout(context.Background(), -1*time.Second) |
There was a problem hiding this comment.
Lol, didn't know that's valid!
There was a problem hiding this comment.
Turns out I didn't even invent it! I particularly like that it's used in the runtime's own unit tests.
There was a problem hiding this comment.
Nice, so they would break their own tests before it ships in a go release 😆
Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous `src batch` runs tend to be the culprit here. This seems to be most easily detectable by looking at the behaviour of `docker image inspect`. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place. This commit adds a five second timeout to `docker image inspect` invocations, along with a secret environment variable to adjust this behaviour. If `docker` times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?"). The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.
Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous
src batchruns tend to be the culprit here.This seems to be most easily detectable by looking at the behaviour of
docker image inspect. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place.This commit adds a five second timeout to
docker image inspectinvocations, along with a secret environment variable to adjust this behaviour. Ifdockertimes out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?").The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.
Test plan
There's decent test coverage here, and I've added a new test case for this specific change.
I also tested this manually in these scenarios:
dockerfails to respond (bykillall -STOP dockerdin the Docker Desktop VM)0via the environment variable sosrcnever actually waits and immediately fails