Skip to content

docker: use a short timeout when inspecting images#783

Merged
LawnGnome merged 1 commit into
mainfrom
aharvey/short-timeout
Jun 15, 2022
Merged

docker: use a short timeout when inspecting images#783
LawnGnome merged 1 commit into
mainfrom
aharvey/short-timeout

Conversation

@LawnGnome

Copy link
Copy Markdown
Contributor

Docker Desktop seems to be prone to occasionally entering states where it is nominally available (in the sense that it listens for connections), but can't actually do anything useful. Memory exhaustion events from previous src batch runs tend to be the culprit here.

This seems to be most easily detectable by looking at the behaviour of docker image inspect. This command doesn't have to hit the network, nor does it perform any real work, so it should always be "quick". If it's not, that implies that Docker is not in a happy place.

This commit adds a five second timeout to docker image inspect invocations, along with a secret environment variable to adjust this behaviour. If docker times out, then a (hopefully) helpful error message is generated to hint towards the things that the user should investigate next (basically, "is Docker really working?").

The only real concern I have with this is that I've basically picked five seconds out of thin air: practically, Docker for Mac seems to be able to do this in tens of milliseconds, but Sourcegraph has bought me a rather nice laptop. I have trouble imagining a scenario where multiple second delays are common where there aren't other serious issues, but the world is a large, weird place.

Test plan

There's decent test coverage here, and I've added a new test case for this specific change.

I also tested this manually in these scenarios:

  • Regular operation (to confirm nothing actually changed)
  • docker fails to respond (by killall -STOP dockerd in the Docker Desktop VM)
  • Set the timeout to 0 via the environment variable so src never actually waits and immediately fails

Docker Desktop seems to be prone to occasionally entering states where
it is nominally available (in the sense that it listens for
connections), but can't actually do anything useful. Memory exhaustion
events from previous `src batch` runs tend to be the culprit here.

This seems to be most easily detectable by looking at the behaviour of
`docker image inspect`. This command doesn't have to hit the network,
nor does it perform any real work, so it should always be "quick". If
it's not, that implies that Docker is not in a happy place.

This commit adds a five second timeout to `docker image inspect`
invocations, along with a secret environment variable to adjust this
behaviour. If `docker` times out, then a (hopefully) helpful error
message is generated to hint towards the things that the user should
investigate next (basically, "is Docker really working?").

The only real concern I have with this is that I've basically picked
five seconds out of thin air: practically, Docker for Mac seems to be
able to do this in tens of milliseconds, but Sourcegraph has bought me a
rather nice laptop. I have trouble imagining a scenario where multiple
second delays are common where there aren't other serious issues, but
the world is a large, weird place.
Comment on lines -69 to -72
// TODO!(sqs): is image id the right thing to use here? it is NOT
// the digest. but the digest is not calculated for all images
// (unless they are pulled/pushed from/to a registry), see
// https://github.com/moby/moby/issues/32016.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO was outdated: there's actually a lengthy comment earlier in the file explaining what "digest" means in this context, and how it relates to the public digest. Whoever last touched this really should have removed this then. 😬

@LawnGnome LawnGnome requested a review from a team June 15, 2022 00:24
@LawnGnome LawnGnome marked this pull request as ready for review June 15, 2022 00:24

@courier-new courier-new left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great improvement. And worst case if 5s is too sensitive, it's fortunately very easy to release a patch release for src-cli to bump it up. 🙂

@eseliger eseliger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach!

// of the timeout, and we don't want to slow down the test, so we're
// going to construct a context that has already exceeded its deadline
// at the point it is provided to Digest.
ctx, cancel := context.WithTimeout(context.Background(), -1*time.Second)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, didn't know that's valid!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I didn't even invent it! I particularly like that it's used in the runtime's own unit tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so they would break their own tests before it ships in a go release 😆

@LawnGnome LawnGnome merged commit 4723b4b into main Jun 15, 2022
@LawnGnome LawnGnome deleted the aharvey/short-timeout branch June 15, 2022 16:58
@LawnGnome LawnGnome self-assigned this Jun 15, 2022
scjohns pushed a commit that referenced this pull request Apr 24, 2023
Docker Desktop seems to be prone to occasionally entering states where
it is nominally available (in the sense that it listens for
connections), but can't actually do anything useful. Memory exhaustion
events from previous `src batch` runs tend to be the culprit here.

This seems to be most easily detectable by looking at the behaviour of
`docker image inspect`. This command doesn't have to hit the network,
nor does it perform any real work, so it should always be "quick". If
it's not, that implies that Docker is not in a happy place.

This commit adds a five second timeout to `docker image inspect`
invocations, along with a secret environment variable to adjust this
behaviour. If `docker` times out, then a (hopefully) helpful error
message is generated to hint towards the things that the user should
investigate next (basically, "is Docker really working?").

The only real concern I have with this is that I've basically picked
five seconds out of thin air: practically, Docker for Mac seems to be
able to do this in tens of milliseconds, but Sourcegraph has bought me a
rather nice laptop. I have trouble imagining a scenario where multiple
second delays are common where there aren't other serious issues, but
the world is a large, weird place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants