Skip to content

dl.google.com is currently a single point of failure #3136

@fishy

Description

@fishy

What version of rules_go are you using?

v0.31.0

What version of gazelle are you using?

v0.24.0

What version of Bazel are you using?

5.1.1

Does this issue reproduce with the latest releases of all the above?

Yes

What operating system and processor architecture are you using?

linux/amd64

Any other potentially useful information about your toolchain?

What did you do?

There was (I assume) a bad deploy that broke dl.google.com yesterday afternoon around 3:45pm (pacific time) that was fixed ~15min later. during that ~15min, all builds on our CI/CD system failed with:

190 | WARNING: Download from https://dl.google.com/go/go1.18.1.linux-amd64.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 502 Bad Gateway
191 | ERROR: An error occurred during the fetch of repository 'go_sdk':
192 | Traceback (most recent call last):
193 | File "/root/.cache/bazel/_bazel_root/fc07cdbdb3ccc5391e01bb1a31f63d3c/external/io_bazel_rules_go/go/private/sdk.bzl", line 100, column 16, in _go_download_sdk_impl
194 | _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
195 | File "/root/.cache/bazel/_bazel_root/fc07cdbdb3ccc5391e01bb1a31f63d3c/external/io_bazel_rules_go/go/private/sdk.bzl", line 205, column 21, in _remote_sdk
196 | ctx.download(
197 | Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.1.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/fc07cdbdb3ccc5391e01bb1a31f63d3c/external/go_sdk/go_sdk.tar.gz: GET returned 502 Bad Gateway

We happened to be trying to do a deploy around the time and have to keep retrying it until dl.google.com was fixed to finally be able to continue the deploy.

For context, we have this part related to go_sdk in our WORKSPACE file:

GO_VERSION = "1.18.1"

...

go_register_toolchains(version = GO_VERSION)

Here are some ideas/suggestions/feature requests to make dl.google.com no longer a single point of failure:

  1. Make the go_sdk cache-able by remote cache

We do have a bazel remote cache setup (backed by an s3 bucket) for our CI/CD system. Since we have pinned go version, the download url for go_sdk is fixed, so if rules_go can get that from remote cache instead of the original source that would fix most of the problem.

I assume rules_go might still need the index file containing the checksum, which is dynamic by nature and not cache-able, so that might make this less feasible in reality?

  1. Add urls arg to go_register_toolchains

During the dl.google.com outage I tried to see if I can override the download url from rules_go, and found out that there's urls arg for go_download_sdk, but not for go_register_toolchains. If we add urls arg to go_register_toolchains so I can set it to both https://dl.google.com/go/{} and https://go.dev/dl/{}, it might helped during similar outages (I'm not 100% sure whether go.dev was affected by the same outage, when I tried go.dev and found out that it works, dl.google.com also recovered shortly after), or we can run an internal mirror of it.

  1. Better documentation to sdks arg of go_download_sdk

Alternatively to 2, the only way to avoid using the index file for checksum I can find of is via the sdks arg of go_download_sdk, but the documentation says "see description" and I don't see any example in the description to show how this string_list_dict is supposed to look like. If we can add an example of it to the documentation, I guess I can also switch to use go_download_sdk in order to pin the mirrors and checksums.

So our WORKSPACE would probably look like this:

GO_VERSION = "1.18.1"
GO_LINUX_AMD64_SHA256 = "..."
GO_DARWIN_AMD64_SHA256 = "..."
GO_DARWIN_ARM64_SHA256 = "..."

...

go_download_sdks(
    name = "go_sdk",
    version = GO_VERSIONS,
    urls = [
        "https://dl.google.com/go/{}",
        "https://go.dev/dl/{}",
        # internal mirror here
    ],
    sdks = ...,
)

go_register_toolchains()

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions