GitLab Runner build failures for Docker deployments (Docker 29)
## Summary I'm using self hosted GitLab runners that were setup using this documentation: https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/ I noticed today that our deployments fail because the health check for the DIND service suddenly seems to fail. This hasn't been a problem for months but started to occur some days ago. Basically we're building Docker images, pushing it to to GitLab registry and for deployment, we pull it from GitLab and push it to AWS ECR. ## Steps to reproduce Just try to execute a job that's defined like the job in the following details section: <!-- Please add the definition of the job from `.gitlab-ci.yml` that is failing inside of the code blocks (```) below. --> <details> <summary> .gitlab-ci.yml </summary> ```yml default: image: docker:24.0.5-cli services: - name: docker:24.0.5-dind variables: HEALTHCHECK_TCP_PORT: "2375" before_script: - docker info variables: # 1) Name of directory where restore and build objects are stored. OBJECTS_DIRECTORY: 'obj' # 2) Name of directory used for keeping restored dependencies. NUGET_PACKAGES_DIRECTORY: '.nuget' # 3) A relative path to the source code from project repository root. # NOTE: Please edit this path so it matches the structure of your project! SOURCE_CODE_PATH: 'src/*/' # Docker DOCKER_DRIVER: overlay2 # When using dind service, you must instruct Docker to talk with # the daemon started inside of the service. The daemon is available # with a network connection instead of the default # /var/run/docker.sock socket. DOCKER_HOST: tcp://docker:2375 # # The 'docker' hostname is the alias of the service container as described at # https://docs.gitlab.com/ee/ci/services/#accessing-the-services. # # This instructs Docker not to start over TLS. DOCKER_TLS_CERTDIR: "" # Other variables redacted... # ... deploy:aws: stage: deploy image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest before_script: - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY script: # 1) Pull from GitLab, retag, push to ECR - docker pull --platform ${SERVER_PLATFORM_AWS} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG} - docker tag $CI_REGISTRY_IMAGE/api:${DOCKER_TAG} $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG} - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ECR_REGISTRY - docker push $AWS_ECR_REGISTRY/group/backend:${DOCKER_TAG} ``` </details> ## Actual behavior Job fails with error: ``` Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running? ``` ## Expected behavior Job should be executed successfully as it did during the last months ## Relevant logs and/or screenshots <!-- Paste the job logs inside of the code blocks (```) below so it would be easier to read. --> <details> <summary> job log </summary> ```sh Running with gitlab-runner 18.3.0 (9ba718cd) on gitlab-aws-autoscaler iEYRzzd8s, system ID: s_e59dad7ef83b Preparing the "docker+machine" executor 00:03 Using Docker executor with image docker:24.0.5-cli ... Starting service docker:24.0.5-dind... Using effective pull policy of [always] for container docker:24.0.5-dind Pulling docker image docker:24.0.5-dind ... Using docker image sha256:7015f2c475d511a251955877c2862016a4042512ba625ed905e69202f87e1a21 for docker:24.0.5-dind with digest docker@sha256:3c6e4dca7a63c9a32a4e00da40461ce067f255987ccc9721cf18ffa087bcd1ef ... Waiting for services to be up and running (timeout 180 seconds)... *** WARNING: Service runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0 probably didn't start properly. Health check error: service "runner-ieyrzzd8s-project-486-concurrent-0-81da683409ed8f1b-docker-0-wait-for-service" health check: exit code 1 Health check container logs: 2025-11-11T07:59:46.462387831Z FATAL: No HOST or PORT found Service container logs: 2025-11-11T07:59:46.295360051Z time="2025-11-11T07:59:46.295243880Z" level=info msg="Starting up" 2025-11-11T07:59:46.295762312Z time="2025-11-11T07:59:46.295656887Z" level=warning msg="Binding to IP address without --tlsverify is insecure and gives root access on this machine to everyone who has access to your network." host="tcp://0.0.0.0:2375" 2025-11-11T07:59:46.295778187Z time="2025-11-11T07:59:46.295682203Z" level=warning msg="Binding to an IP address, even on localhost, can also give access to scripts run in a browser. Be safe out there!" host="tcp://0.0.0.0:2375" ********* Using effective pull policy of [always] for container docker:24.0.5-cli Pulling docker image docker:24.0.5-cli ... Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ... Preparing environment 00:01 Using effective pull policy of [always] for container sha256:446e9bb1f9f503abc0a8b81b04acbdceca703007eb5bd10f827b0292a88e9787 Running on runner-ieyrzzd8s-project-486-concurrent-0 via runner-ieyrzzd8s-gitlab-docker-machine-1762842181-5ae87b1b... Getting source from Git repository ``` and ```sh Executing "step_script" stage of the job script 00:01 Using effective pull policy of [always] for container docker:24.0.5-cli Using docker image sha256:99c502855bab44eb998644c302407cbbcebfb6dc2a6d9c892acb00c412ca1902 for docker:24.0.5-cli with digest docker@sha256:21d8477f7dcd514414b1ffea6670d9963f0c81d23303452bb3ad7f93fedacb64 ... $ docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store Login Succeeded $ eval $(ssh-agent -s) Agent pid 29 $ ssh-add <(echo "$SSH_PRIVATE_KEY") Identity added: /dev/fd/64 (redacted@redacted.local) $ mkdir -p ~/.ssh $ echo "$SSH_PRIVATE_KEY" >> ~/.ssh/id_rsa $ chmod 600 ~/.ssh/id_rsa $ echo "Host $REMOTE_HOST" >> ~/.ssh/config $ echo "IdentityFile ~/.ssh/id_rsa" >> ~/.ssh/config $ [[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config $ apk add rsync fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz (1/6) Installing libacl (2.3.1-r3) (2/6) Installing lz4-libs (1.9.4-r4) (3/6) Installing popt (1.19-r2) (4/6) Installing libxxhash (0.8.2-r0) (5/6) Installing zstd-libs (1.5.5-r4) (6/6) Installing rsync (3.4.0-r0) Executing busybox-1.36.1-r2.trigger OK: 14 MiB in 28 packages $ docker pull --platform ${SERVER_PLATFORM_DO} $CI_REGISTRY_IMAGE/api:${DOCKER_TAG} Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running? Cleaning up project directory and file based variables 00:00 ERROR: Job failed: exit code 1 ``` </details> ## Environment description <!-- Are you using shared Runners on GitLab.com? Or is it a custom installation? Which executors are used? Please also provide the versions of related tools like `docker info` if you are using the Docker executor. --> <!-- Please add the contents of `config.toml` inside of the code blocks (```) below, remember to remove any secret tokens! --> <details> <summary> config.toml contents </summary> ```toml concurrent = 4 check_interval = 0 [session_server] session_timeout = 1800 [[runners]] name = "gitlab-aws-autoscaler" limit = 4 url = "https://gitlab.dev.local" token = "redacted" executor = "docker+machine" [runners.cache] Type = "s3" Shared = true [runners.cache.s3] ServerAddress = "redacted" AccessKey = "redacted" SecretKey = "redacted" BucketName = "redacted" BucketLocation = "redacted" [runners.docker] tls_verify = false image = "docker:27.4" privileged = true disable_entrypoint_overwrite = false oom_kill_disable = false disable_cache = true shm_size = 0 environment = ["LC_ALL=en_US.UTF-8", "TERM=xterm"] wait_for_services_timeout = 180 [runners.machine] IdleCount = 0 IdleTime = 1800 MaxBuilds = 25 MachineDriver = "amazonec2" MachineName = "gitlab-docker-machine-%s" MachineOptions = [ "amazonec2-access-key= redacted", "amazonec2-secret-key= redacted", "amazonec2-region=eu-central-1", "amazonec2-vpc-id=redacted", "amazonec2-subnet-id=redacted", "amazonec2-use-private-address=true", "amazonec2-zone=b", "amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true", "amazonec2-security-group=docker-machine-scaler", "amazonec2-instance-type=m4.xlarge", "amazonec2-ami=ami-0faab6bdbac9486fb", "amazonec2-root-size=24", "amazonec2-request-spot-instance=true", ] ``` </details> ### Used GitLab Runner version <!-- Please run and paste the output of `gitlab-runner --version`. If you are using a Runner where you don't have access to, please paste at least the first lines the from build log, like: ``` Running with gitlab-ci-multi-runner 1.4.2 (bcc1794) Using Docker executor with image golang:1.8 ... ``` --> ## Workarounds This issue is due to Docker further dropping support for [`links`](https://docs.docker.com/engine/network/links/), which is a legacy method for Docker containers to find each other (such as our build and service containers). When Docker announced deprecation of links, we hoped that `FF_NETWORK_PER_BUILD` would be ready to roll out for everybody. However, this had issues in some environments, and Docker's default network ranges often conflicted with network ranges already in use inside corporations. `FF_NETWORK_PER_BUILD` being enabled likely solve this for most customers though. The workaround at the moment are therefore: - Enable the `FF_NETWORK_PER_BUILD` feature flag. - Start the docker daemon the host with `DOCKER_KEEP_DEPRECATED_LEGACY_LINKS_ENV_VARS=1`. This starts the Docker daemon with links support still enabled. ## Possible fixes <!-- (If you can, link to the line of code that might be responsible for the problem) --->
issue