Make log reading more robust to errors#47983
Conversation
94d4f6d to
32fed58
Compare
|
The issue with Windows seems to be caused by the frequency with which time is sampled on Windows, which is by default every 15ms. I looked at using mod time but on linux filesystems the precision is in ms, not ns so its not reliable (tests fail on Linux here but pass on Windows, actually). Trying to not have to digest every log file when we compress it. |
e903561 to
60b09de
Compare
|
Ok I think this is ready. |
corhere
left a comment
There was a problem hiding this comment.
I have noticed that the refactor also makes it possible for log drivers to implement "skip to next file on decode failure" behavior themselves, simply by having their (loggerutils.Decoder).Decode() implementation return io.EOF instead of the JSON decode error or whatever. And drivers which use a self-synchronizing stream format could implement more intelligent recovery logic since the decoder is in charge of advancing the reader. But I have convinced myself that it is still worthwhile to lift the "skip file" logic into loggerutils because it works as a fallback for all log drivers, including hypothetical drivers with intelligent recovery logic. It'll be there as a fallback recovery mechanism if the driver's resynchronization logic falls over (returns an unexpected error) on a particularly corrupt file.
5bf148f to
d47c742
Compare
|
d47c742 to
c99cdbb
Compare
|
So unfortunately its not that straight forward to have forwarder manage everything and still return just a bool because logfile following needs to have different error handling semantics from tailing. What I ended p doing here is wrapping the decoder's |
c99cdbb to
1be9331
Compare
| } | ||
|
|
||
| func (s *journald) ReadLogs(config logger.ReadConfig) *logger.LogWatcher { | ||
| func (s *journald) ReadLogs(ctx context.Context, config logger.ReadConfig) *logger.LogWatcher { |
There was a problem hiding this comment.
Plumbing contexts fully into the journald reader with cancelation is pretty straightforward: corhere@b603933
Feel free to cherry-pick that commit into this PR. Otherwise I would be happy to rebase and open a follow-up PR once this one has landed.
There was a problem hiding this comment.
Pulled in that commit (with some minor changes).
4d4679d to
7df30a5
Compare
Times cannot be compared with `==` and instead should use the `t.Equal` function. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
When there is an error in parsing an individual log file just close the log and move on to the next one instead of erroring our the entire request. I investigated trying to error correct and scan ahead for corrupted log files but found this is too much of a risk of parsing things we shouldn't be and hence why this is just dropping the rest of the file. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This plumbs a context down the stack and handles cancellation as needed so that we can have correlated traces from the API. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
7df30a5 to
8b40a64
Compare
|
@corhere I think I've addressed everything here. |
8b40a64 to
8df4553
Compare
daemon/logs.go
Outdated
| if retErr != nil { | ||
| span.SetStatus(retErr) | ||
| } |
There was a problem hiding this comment.
nit: SetStatus checks for nil and acts accordingly on its own:
moby/vendor/github.com/containerd/containerd/tracing/tracing.go
Lines 99 to 108 in fcf03cd
| if retErr != nil { | |
| span.SetStatus(retErr) | |
| } | |
| span.SetStatus(retErr) |
| span.AddEvent("watch consumer gone") | ||
| return false | ||
| case <-ctx.Done(): | ||
| span.AddEvent(ctx.Err().Error()) |
There was a problem hiding this comment.
Why not use span.SetStatus?
There was a problem hiding this comment.
I think the reason I did this was because its not really an error that the caller cancelled the context... but am fine either way.
This allows for an individual decode operation to be cancelled while the log reader is reading data from a log file by closing the underlying file. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Co-authored-by:: Cory Snider <csnider@mirantis.com> Signed-off-by: Brian Goff <cpuguy83@gmail.com>
8df4553 to
6d94122
Compare
|
Hmmm, interesting panic: Unrelated to this PR, but what's strange here this stack trace matches the code from
Code from this PR: moby/daemon/containerd/image_list.go Lines 356 to 358 in 6d94122 Code from master: moby/daemon/containerd/image_list.go Lines 356 to 358 in d8f079d |
vvoland
left a comment
There was a problem hiding this comment.
LGTM; but please try to add a changelog description 😅
|
Added changelog description. |
|
Opened backport PR here: #48842 |
|
Cross-referencing, in case this patch would make its way in other branches; |
Fundamentally this change (beyond the minor refactor and tracing spans) is to not treat errors reading from log files as fatal to providing logs to the client.
The need for this change is because often log files may be corrupted either due to old bugs in dockerd or for other reasons.
One specific thing we have seen seems to be caused by ext4 by default is not doing full journaling. On power loss or kernel panic this can cause the log data to be corrupted.
We see this mostly on IoT devices which are often not on battery backup or frequently hard cut from power.
Instead of erroring out a stream on errors from a particular file this just skips on to the next file.
I had made a previous attempt to be a little more clever and scrub the file to try and find a place where we can start parsing again but realized this can cause serious issues and is generally error prone.
This also takes care of a long-standing todo to handle log files lazily (or at least as lazily as possible) so we only end up decompressing files just-in-time.
Unfortunately we don't have a way to return warnings to the client that we had to skip files, however with the tracing changes at least it could be seen there and in the daemon logs.