Decouple removing the fileWatcher from reading#27782
Conversation
dc0b73e to
2763344
Compare
2763344 to
5d06603
Compare
5d06603 to
5a446f7
Compare
|
I added in the initial commit because |
|
Nice, I've observed this deadlock as well. |
|
|
@cpuguy83 see my comment above, I changed the signature of a reader because FYI without the removal of the cast the tests all happily passed locally. |
|
I could do something like: That should keep both lint and the build happy. |
5a446f7 to
f78e713
Compare
|
I've pushed that, so lets see what happens 🍀 |
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
There should be no need to cast here.
Maybe just for simplicity (removing the else) we can do:
var rdr io.Reader
rdr = f
if tail > 0 {
...
}And the linter may be happy?
There was a problem hiding this comment.
I'll let this build run to see if everything is happy and then make that change.
There was a problem hiding this comment.
I've pushed the change. Last time it passed all test suites, so as this update has got passed lint I am hopeful 😄
`golint` had the following issue when linting this file: ``` daemon/logger/jsonfilelog/read.go:116:10: should omit type io.Reader from declaration of var rdr; it will be inferred from the right-hand side ``` In order to keep it happy changing it to an indirect assignment will still maintain the same functionality. Signed-off-by: Tom Booth <tombooth@gmail.com>
f78e713 to
814f5ed
Compare
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
Looking at the watch loop below, I think we need some way to cancel this goroutine since the loop may exit without logwatcher closing. May be time to whip out a context.
There was a problem hiding this comment.
Good shout! I've just pushed an update to the code, does it look a little better? If it's not quite how you imagined let me know.
There was a problem hiding this comment.
What do you reckon the chances of this making it into v1.13.0 are @cpuguy83 ?
9c69552 to
bb1500c
Compare
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
Maybe instead of creating this channel, we can do context.WithCancel(ctx) to create a nested context.
bb1500c to
8d9380c
Compare
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
Sorry, one more thing, I think this need to run in either case
There was a problem hiding this comment.
It's no problem, I had assumed the deferred fileWatcher.Close() would have taken care of it but looking at the source it doesn't clear up any open inotify watchers so we would be leaking descriptors. Fix incoming.
There was a problem hiding this comment.
:) defer filewatcher.Remove above the select, but no worries.
There was a problem hiding this comment.
Sorry, it is a little late over here. I've just been having a think about the order of execution. If a fileWatcher.Remove happens after fileWatcher.Close it would likely hang as Remove would be waiting for an IN_IGNORE event but readEvents (in https://github.com/docker/docker/blob/362eb4cfbb09b5dc0fe648360a2e6d0546bf6613/vendor/src/gopkg.in/fsnotify.v1/inotify.go) would have exited because the watcher was closed so it would never receive it. This makes me a little concerned about the Remove after ctx.Done() as the fileWatcher.Close() in the second defer could have run by the time Remove is called.
I'm fairly new to Go so if that doesn't sounds right (could also be because I'm tried) let me know, otherwise I might switch to a fileWatcher.Remove before the Close in the defer above the go block. If it has already been removed it should have no issue, it will just return an error that we ignore anyway: https://github.com/docker/docker/blob/362eb4cfbb09b5dc0fe648360a2e6d0546bf6613/vendor/src/gopkg.in/fsnotify.v1/inotify.go#L135
8d9380c to
b96f139
Compare
|
LGTM if janky is happy |
b96f139 to
68c3c21
Compare
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
this is not compatible with go1.6 I think, you should use golang.org/x/net/context I think.
There was a problem hiding this comment.
I've switched this, thanks for the pointer.
daemon/logger/jsonfilelog/read.go
Outdated
There was a problem hiding this comment.
I think top context is unnecessary(maybe we'll pass it from api call later).
So, I think this line should look like:
ctx, cancel := context.WithCancel(ctx)
defer cancel()
There was a problem hiding this comment.
We need one to notify that the watcher is closed, and the other to notify that the goroutine needs to exit.
There was a problem hiding this comment.
I don't really understand why we can't use one for both?
There was a problem hiding this comment.
Other than multiple owners of cancel()? It's probably ok. I prefer the separation, but assuming nothing panics it seems ok.
There was a problem hiding this comment.
I've moved it to a single context.
|
LGTM |
Perhaps. Unfortunately in the last months I never saw the TestLogsFollowSlowStdoutConsumer testcase failing again. I would be fine with closing #16462. |
|
thanks @michael-holzheu, let me close it for now, we can reopen if it occurs again |
`golint` had the following issue when linting this file: ``` daemon/logger/jsonfilelog/read.go:116:10: should omit type io.Reader from declaration of var rdr; it will be inferred from the right-hand side ``` In order to keep it happy changing it to an indirect assignment will still maintain the same functionality. fix DTS2017062809604 backport from moby#27782 without conflicts. Signed-off-by: Tom Booth <tombooth@gmail.com> Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Fixes moby#27779 Currently `followLogs` can get into a deadlock if we receive an inotify IN_MODIFY event while we are trying to close the `fileWatcher`. This is due to the fact that closing the `fileWatcher` happens in the same block as consumes events from the `fileWatcher`. We are trying to run `fileWatcher.Close`, which is waiting for an IN_IGNORE event to come in over inotify to confirm the watch was been removed. But, because an IN_MODIFY event has appeared after `Close` was entered but before the IN_IGNORE, the broadcast never comes. The IN_MODIFY cannot be consumed as the events channel is unbuffered and the only `select` that reads from it is busy waiting for the IN_IGNORE event. In order to try and fix this race condition I've moved the removal of the `fileWatcher` out to a separate go block that waits for a signal to close, removes the watcher and then signals to the previous selects on the close signal. This has introduced a `fileWatcher.Remove` in the final case, but if we try and remove a watcher that does not exist it will just return an error saying so. We are not doing any checking on the return of `Remove` so this shouldn't cause any side-effects. fix DTS2017062809604 backport from moby#27782 without conflicts. Signed-off-by: Tom Booth <tombooth@gmail.com> Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Fix docker logs hang
Steps to reproduce the issue:
1. Run the attached script, eventually the `docker logs -f` command will hang:
```
#!/bin/bash
while true
do
container_id="$(docker run -d ubuntu:latest /bin/bash -e -c "echo -n \"foo\"")"
docker logs -f "${container_id}"
done
```
fix DTS2017062809604
backport from moby#31856
backport from moby#27782
backport from moby#22088
conflicts:
daemon/attach.go
daemon/logs.go
daemon/logger/journald/read.go
daemon/logger/jsonfilelog/read.go
daemon/logs.go
Signed-off-by: Jim Minter <jminter@redhat.com>
Signed-off-by: Tom Booth <tombooth@gmail.com>
Signed-off-by: Amit Krishnan <krish.amit@gmail.com>
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
See merge request docker/docker!625
|
For the sake of people reading this: the bug causing the issue is a bug in fsnotify package (fsnotify/fsnotify#195, fsnotify/fsnotify#123, fsnotify/fsnotify#115), which has since been fixed by commit fsnotify/fsnotify@4da3e2c "Fix deadlock in Remove (linux/inotify)". The fix has appeared in fsnotify v.1.4.7, which was vendored to moby/moby in commit 9f016c0. The fix in this PR was just a workaround for the above fsnotify bug. All this means the workarond can be now removed. |
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit a69a59f ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit a69a59f ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 916eabd)
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit a69a59f ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 916eabd)
[Conflict while cherry-picking due to missing commit 94a1015]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit a69a59f ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby/moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby/moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Upstream-commit: 916eabd459fe707b5c4a86377d12e2ad1871b353
Component: engine
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit a69a59f ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby/moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby/moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 916eabd459fe707b5c4a86377d12e2ad1871b353)
Upstream-commit: 84a5b528aede5579861201e869870d10fc98c07c
Component: engine
When daemon.ContainerLogs() is called with options.follow=true
(as in "docker logs --follow"), the "loggerutils.followLogs()"
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running "docker logs --follow"
and pressing Ctrl-C a few times), this results in DoS caused by
either hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
1. logs producer (a container) is gone, calling (*LogWatcher).Close()
for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175).
2. WatchClose() is properly handled by a dedicated goroutine in
followLogs(), cancelling the context.
3. Upon receiving the ctx.Done(), the code in followLogs()
(daemon/logger/loggerutils/logfile.go#L626-L638) keeps to
send messages _synchronously_ (which is OK for now).
4. Logs consumer is gone (Ctrl-C is pressed on a terminal running
"docker logs --follow"). Method (*LogWatcher).Close() is properly
called (see daemon/logs.go:114). Since it was called before and
due to to once.Do(), nothing happens (which is kinda good, as
otherwise it will panic on closing a closed channel).
5. A goroutine (see item 3 above) keeps sending log messages
synchronously to the logWatcher.Msg channel. Since the
channel reader is gone, the channel send operation blocks forever,
and resource cleanup set up in defer statements at the beginning
of followLogs() never happens.
Alas, the fix is somewhat complicated:
1. Distinguish between close from logs producer and logs consumer.
To that effect,
- yet another channel is added to LogWatcher();
- {Watch,}Close() are renamed to {Watch,}ProducerGone();
- {Watch,}ConsumerGone() are added;
*NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed
in order to stop ConsumerLogs(follow=true) when a container is stopped;
otherwise we're not interested in it. In other words, we're only
using it in followLogs().
2. Code that was doing (logWatcher*).Close() is modified to either call
ProducerGone() or ConsumerGone(), depending on the context.
3. Code that was waiting for WatchClose() is modified to wait for
either ConsumerGone() or ProducerGone(), or both, depending on the
context.
4. followLogs() are modified accordingly:
- context cancellation is happening on WatchProducerGone(),
and once it's received the FileWatcher is closed and waitRead()
returns errDone on EOF (i.e. log rotation handling logic is disabled);
- due to this, code that was writing synchronously to logWatcher.Msg
can be and is removed as the code above it handles this case;
- function returns once ConsumerGone is received, freeing all the
resources -- this is the bugfix itself.
While at it,
1. Let's also remove the ctx usage to simplify the code a bit.
It was introduced by commit 9de9efe ("Decouple removing the
fileWatcher from reading") in order to fix a bug. The bug was actually
a deadlock in fsnotify, and the fix was just a workaround. Since then
the fsnofify bug has been fixed, and a new fsnotify was vendored in.
For more details, please see
moby/moby#27782 (comment)
2. Since `(*filePoller).Close()` is fixed to remove all the files
being watched, there is no need to explicitly call
fileWatcher.Remove(name) anymore, so get rid of the extra code.
Should fix moby/moby#37391
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fixes #27779
Description
When a container exits the state change will bubble up through libcontainerd into the daemon causing the container to be reset. As part of this it will close the LogDriver associated with the container and the driver will close all of its children. We have observed a race condition between closing a reader and receiving the final data from the log.
followLogswill receive a IN_MODIFY event while it is trying to close the fileWatcher. This causes a deadlock as https://github.com/docker/docker/blob/master/vendor/src/gopkg.in/fsnotify.v1/inotify.go#L266 is trying to put the modify event on the Events channel, but because https://github.com/docker/docker/blob/master/daemon/logger/jsonfilelog/read.go#L235 is trying to process the close it cannot take the event off the channel.Removeis blocked because it is waiting on a condition to be broadcast https://github.com/docker/docker/blob/master/vendor/src/gopkg.in/fsnotify.v1/inotify.go#L157, this would be triggered by receiving an IN_IGNORE event https://github.com/docker/docker/blob/master/vendor/src/gopkg.in/fsnotify.v1/inotify.go#L289, but due to the code stuck trying to push the IN_MODIFY onto the Events channel this is never processed.We initially saw this behaviour with v1.12.1, but have replicated it building the daemon from master. Additionally, we first observed this using the HTTP API to follow the logs, but the same behaviour can been seen with
docker logs -fHere is a gist that should show you the order of events that is causing this deadlock: https://gist.github.com/tombooth/e38175f07233c683704528cf406c3120
Steps to reproduce the issue:
docker logs -fcommand will hangDescribe the results you received:
The logs function hung
Describe the results you expected:
It should have exited
Additional information you deem important (e.g. issue happens only occasionally):
This doesn't occur on every execution of the the loop block. On my local machine, running Linux in a container, it seems to hang one in two-three times.
Output of
docker version:Client:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built: Tue Oct 11 18:29:41 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built: Tue Oct 11 18:29:41 2016
OS/Arch: linux/amd64
Output of
docker info:Containers: 209
Running: 1
Paused: 0
Stopped: 208
Images: 109
Server Version: 1.12.2
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 505
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-45-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.859 GiB
Name: vagrant
ID: QXD3:X4JO:E52F:CWCS:IY3J:RN36:EV36:I2KK:7JH4:WOJ3:MSL5:7O7P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.):
We've seen this on AWS, in VMWare Fusion and on a physical machine.