docker: use a short timeout when inspecting images by LawnGnome · Pull Request #783 · sourcegraph/src-cli

LawnGnome · 2022-06-15T00:22:55Z

Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous src batch runs tend to be the culprit here.

This seems to be most easily detectable by looking at the behaviour of docker image inspect. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place.

This commit adds a five second timeout to docker image inspect invocations, along with a secret environment variable to adjust this behaviour. If docker times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?").

The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.

Test plan

There's decent test coverage here, and I've added a new test case for this specific change.

I also tested this manually in these scenarios:

Regular operation (to confirm nothing actually changed)
docker fails to respond (by killall -STOP dockerd in the Docker Desktop VM)
Set the timeout to 0 via the environment variable so src never actually waits and immediately fails

Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous `src batch` runs tend to be the culprit here. This seems to be most easily detectable by looking at the behaviour of `docker image inspect`. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place. This commit adds a five second timeout to `docker image inspect` invocations, along with a secret environment variable to adjust this behaviour. If `docker` times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?"). The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.

LawnGnome · 2022-06-15T00:23:40Z

-				// TODO!(sqs): is image id the right thing to use here? it is NOT
-				// the digest. but the digest is not calculated for all images
-				// (unless they are pulled/pushed from/to a registry), see
-				// https://github.com/moby/moby/issues/32016.


This TODO was outdated: there's actually a lengthy comment earlier in the file explaining what "digest" means in this context, and how it relates to the public digest. Whoever last touched this really should have removed this then. 😬

courier-new

This seems like a great improvement. And worst case if 5s is too sensitive, it's fortunately very easy to release a patch release for src-cli to bump it up. 🙂

eseliger

Nice approach!

eseliger · 2022-06-15T00:41:03Z

+		// of the timeout, and we don't want to slow down the test, so we're
+		// going to construct a context that has already exceeded its deadline
+		// at the point it is provided to Digest.
+		ctx, cancel := context.WithTimeout(context.Background(), -1*time.Second)


Lol, didn't know that's valid!

Turns out I didn't even invent it! I particularly like that it's used in the runtime's own unit tests.

Nice, so they would break their own tests before it ships in a go release 😆

Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous `src batch` runs tend to be the culprit here. This seems to be most easily detectable by looking at the behaviour of `docker image inspect`. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place. This commit adds a five second timeout to `docker image inspect` invocations, along with a secret environment variable to adjust this behaviour. If `docker` times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?"). The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.

LawnGnome commented Jun 15, 2022

View reviewed changes

LawnGnome requested a review from a team June 15, 2022 00:24

LawnGnome marked this pull request as ready for review June 15, 2022 00:24

courier-new approved these changes Jun 15, 2022

View reviewed changes

eseliger approved these changes Jun 15, 2022

View reviewed changes

Piszmog approved these changes Jun 15, 2022

View reviewed changes

LawnGnome merged commit 4723b4b into main Jun 15, 2022

LawnGnome deleted the aharvey/short-timeout branch June 15, 2022 16:58

LawnGnome mentioned this pull request Jun 15, 2022

batches: use Docker CPU count as default parallelism, not GOMAXPROCS #786

Merged

LawnGnome self-assigned this Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: use a short timeout when inspecting images#783

docker: use a short timeout when inspecting images#783
LawnGnome merged 1 commit into
mainfrom
aharvey/short-timeout

LawnGnome commented Jun 15, 2022

Uh oh!

LawnGnome Jun 15, 2022

Uh oh!

courier-new left a comment

Uh oh!

eseliger left a comment

Uh oh!

eseliger Jun 15, 2022

Uh oh!

LawnGnome Jun 15, 2022

Uh oh!

eseliger Jun 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LawnGnome commented Jun 15, 2022

Test plan

Uh oh!

LawnGnome Jun 15, 2022

Choose a reason for hiding this comment

Uh oh!

courier-new left a comment

Choose a reason for hiding this comment

Uh oh!

eseliger left a comment

Choose a reason for hiding this comment

Uh oh!

eseliger Jun 15, 2022

Choose a reason for hiding this comment

Uh oh!

LawnGnome Jun 15, 2022

Choose a reason for hiding this comment

Uh oh!

eseliger Jun 15, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants