Docker daemon threads can become affixed to a container's namespace, causing network errors later when the container is stopped

# Background

I'm able to consistently reproduce an issue where some operating system threads for the Docker daemon start in a network namespace for a container, and stay in that namespace indefinitely. If or when the Go runtime decides to schedule code execution on one of these threads at a later point, that network namespace may be in an broken state, especially if the container has been since torn down. In many instances I've observed, the network namespace will no longer have any routes in its route table.

The symptoms can manifest as `docker` commands on the host (e.g., `docker pull` or `docker push`) erroring with `network is unreachable` or `unknown host`.
# Requisite Version & System Information

```
[root@devenv proc]# docker info
Containers: 31
 Running: 0
 Paused: 0
 Stopped: 31
Images: 180
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-8:16-67174528-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 10.86 GB
 Data Space Total: 107.4 GB
 Data Space Available: 23.19 GB
 Metadata Space Used: 19.17 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.128 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.703 GiB
Name: devenv
ID: XEWN:LJUD:FPKB:6Q4A:24K2:UIXX:DIDD:OQSO:D6PH:AT45:GNN6:RC7X
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
```
# Steps to Reproduce

I find that it's easiest to reproduce when the Docker daemon is started afresh, when the fewest number of operating system threads have been started by the Go runtime.

I also find that it's easiest to reproduce when starting a decent number of containers all at once. I threw together a `docker-compose.yml` file that starts 10 Redis containers:

``` yml
version: "2"
services:
  redis0:
    image: redis:latest
  redis1:
    image: redis:latest
  redis2:
    image: redis:latest
  redis3:
    image: redis:latest
  redis4:
    image: redis:latest
  redis5:
    image: redis:latest
  redis6:
    image: redis:latest
  redis7:
    image: redis:latest
  redis8:
    image: redis:latest
  redis9:
    image: redis:latest
```

Start all the containers at once with:

``` bash
docker-compose up -d
```

Tear down the containers and the network with:

``` bash
docker-compose down
```

Navigate to `/proc/<docker pid>/task` and view the network namespace for each thread:

```
[root@devenv task]# ls -al */ns/net
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 6382/ns/net -> net:[4026531956]
[ ... elided ... ]
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 6992/ns/net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 7044/ns/net -> net:[4026533029]
```

If the bug manifested itself, there will be a thread or two in a different namespace than the rest. In my example, thread 7044 is in a namespace previously used by one of the containers, but since the container has been killed and the network torn down, it's no longer functional:

```
[root@devenv task]# nsenter --net=7044/ns/net ip route
[root@devenv task]# nsenter --net=7044/ns/net ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[root@devenv task]# nsenter --net=7044/ns/net ping google.com
ping: unknown host google.com
```

The bug doesn't always manifest as there is non-determinism involved (which I try to explain later on). Sometimes I need to restart the Docker daemon and try again to get it to manifest.

Because thread 7044 (or any like it) are running as part of the Docker daemon and on top of the Go runtime, the Go runtime is free to schedule code to execute on it. If it does--for instance to handle a Remote API request to pull an image--that code will fail with a network error.
# Analysis

I believe these rogue threads are created when the libnetwork code starts the [embedded DNS server](https://docs.docker.com/engine/userguide/networking/configure-dns/). I'll do my best to step through the code path I believe to be problematic.

When a new container in a non-default network is created, the [sandbox.SetKey function](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/sandbox.go#L543) is invoked to, among other duties, setup the DNS resolver. Specifically, the [function returned by sb.resolver.SetupFunc()](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/sandbox.go#L585) is invoked within the network namespace of the container in order to start the resolver.

libnetwork provides functions like [InvokeFunc](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/osl/namespace_linux.go#L254-L259) and [nsInvoke](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/osl/namespace_linux.go#L272-L296) to execute code in a given network namespace.

`nsInvoke` invokes the function provided as its first argument just before switching namespaces, then switches namespaces, then invokes the function provided as its second argument (now within the new namespace), and finally switches back to the original namespace. Because the Go runtime could ordinarily preempt the code and start execution of a different piece of code on the same operating system thread, `nsInvoke` calls `runtime.LockOSThread()` to make sure the current goroutine is the only one that can use the operating system thread. This prevents other code from inadvertently executing in the network namespace while the operating system thread is switched over to the container namespace.

However, I've determined in my analysis that it is also not safe to start a new goroutine in the critical section where the namespace is switched. Starting a new goroutine in that region has the potential to prompt the Go runtime to start a new operating system thread--cloned from the current one, which inherits any network namespace settings--if the Go scheduler doesn't have any other available threads to run code on.

And, unfortunately, there are pieces of code that (inadvertently?) start goroutines during the setup of the embedded DNS server. Specifically, the `resolver.SetupFunc` function invokes [iptables.RawCombinedOutputNative](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/resolver.go#L131). Following the [call](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/iptables/iptables.go#L394-L401) [chain](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/iptables/iptables.go#L359-L383), we arrive at some code that invokes [exec.Command(...).CombinedOutput()](https://github.com/docker/libnetwork/blob/913dabdc620e375ab21a05939e0c560c94f2ad80/iptables/iptables.go#L372).

The [`exec` package](https://golang.org/src/os/exec/exec.go) is implemented in Go and is pretty easy to read. The `CombinedOutput` method invokes the `Run` method, which in turn invokes the `Start` method. And within the `Start` method, several goroutines are spawned to monitor for errors (via the `errch` channel).

Unfortunately these goroutines are spawned during the time when the network namespace is switched; and when a goroutine is spawned, the Go runtime is free to `clone(2)` a new operating system thread to run the routine, if it doesn't have any other threads available. And when the Docker runtime is early in its runtime, I believe this can happen easily.

To verify this was the issue, I ran `sysdig` to follow the syscalls that led up to a thread being spawned that I observed was 'stuck' in the container namespace:

First, the `setns` call is invoked to switch namespaces on a parent thread, 29917:

```
17426 02:35:02.592507436 0 docker (29917) > setns fd=67(<f>/var/run/docker/netns/14d91abe4631) nstype=16(CLONE_NEWNET)
17427 02:35:02.592507720 0 docker (29917) < setns res=0
```

The thread reports that it is invoking `iptables`:

```
17457 02:35:02.592936615 0 docker (29917) < write res=177 data=time="2016-04-14T02:35:02.592919131Z" level=debug msg="/usr/sbin/iptables, [--wa
17458 02:35:02.592939980 0 docker (29917) > openat dirfd=-100(ENETDOWN) name=/dev/null flags=4097(O_RDONLY|O_CLOEXEC) mode=0
```

It does a typical clone/execve procedure to start `iptables`. This thread's image is replaced by `iptables`, so is of no concern:

```
17492 02:35:02.600341433 0 docker (30070) < clone res=0 exe=/usr/bin/docker args=daemon.-H.fd://.-H.tcp://0.0.0.0:2375.--log-level=debug. tid=30070(docker) pid=30070(docker) ptid=29917(docker) cwd=/ fdlimit=1048576 pgft_maj=0 pgft_min=0 vm_si
ze=765528 vm_rss=24336 vm_swap=0 comm=docker cgroups=cpuset=/.cpu_cgroup=/system.slice/docker.service.cpuacct=/system.slice/docker... flags=0 uid=0 gid=0 vtid=30070(docker) vpid=30070(docker)
17493 02:35:02.600354571 0 docker (30070) > dup fd=70(<f>/dev/null)
17494 02:35:02.600355669 0 docker (30070) < dup res=0(<f>/dev/null)
17495 02:35:02.600355937 0 docker (30070) > dup fd=72(<p>)
17496 02:35:02.600356455 0 docker (30070) < dup res=1(<p>)
17497 02:35:02.600356626 0 docker (30070) > dup fd=72(<p>)
17498 02:35:02.600356737 0 docker (30070) < dup res=2(<p>)
17499 02:35:02.600357022 0 docker (30070) > execve
```

Next, however, a new operating system thread is created as part of the goroutine(s) spun up in the `exec.Start` method:

```
17522 02:35:02.602428507 0 docker (29917) < clone res=30077(docker) exe=/usr/bin/docker args=daemon.-H.fd://.-H.tcp://0.0.0.0:2375.--log-level=debug. tid=29917(docker) pid=29717(docker) ptid=1(systemd) cwd=/ fdlimit=1048576 pgft_maj=0 pgft_min=371 vm_size=773724 vm_rss=36432 vm_swap=0 comm=docker cgroups=cpuset=/.cpu_cgroup=/system.slice/docker.service.cpuacct=/system.slice/docker... flags=47619(CLONE_FILES|CLONE_FS|CLONE_PARENT_SETTID|CLONE_SIGHAND|CLONE_SYSVSEM|CLONE_THREAD|CLONE_VM) uid=0 gid=0 vtid=29917(docker) vpid=29717(docker)
```

This thread, 30077, is started as a regular course of the Go runtime. It's not used to `execve` another program: it's used to send to the `errch` channel in `exec.Start`. When the goroutine finishes, the Go runtime is free to use the operating system thread for another purpose. And because it was started during the time when its parent thread had changed its namespace, it's more-or-less permanently stuck in the container namespace.

The main thread, 29917, goes on to `setns` itself back to the original namespace:

```
18586 02:35:02.620067754 0 docker (29917) > setns fd=11(<f>net:[4026531956]) nstype=16(CLONE_NEWNET)
18587 02:35:02.620069887 0 docker (29917) < setns res=0
```

All is well for thread 29917. But, no such `setns` happens with thread 30077 because the Go runtime abstracts threads from the program. As far as I know, there's no way to even be aware of a new operating system thread being spawned.

If and when thread 30077 is used to run other code as a normal course of the Docker daemon operation, that code will run in the network context of the container. When and if the container and/or network is destroyed, the route table for that network namespace will also be destroyed, leaving it with only a loopback interface and an empty route table (shown above in the `nsenter` commands). It's at this point that any code running on thread 30077 will be unable to reach out to the local network or Internet, so it will return errors for operations like `docker pull` or `docker push`.

Because of the nature of the Go runtime and scheduler, these errors will only occur sporadically, only if the code is scheduled on an affected thread.

Thanks to @lstoll and @wfarr for helping me debug this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker daemon threads can become affixed to a container's namespace, causing network errors later when the container is stopped #1113

Background

Requisite Version & System Information

Steps to Reproduce

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docker daemon threads can become affixed to a container's namespace, causing network errors later when the container is stopped #1113

Description

Background

Requisite Version & System Information

Steps to Reproduce

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions