Make log reading more robust to errors by cpuguy83 · Pull Request #47983 · moby/moby

cpuguy83 · 2024-06-14T01:24:20Z

fixes / addresses Error grabbing logs: invalid character '\x00' looking for beginning of value docker/for-linux#140

Fundamentally this change (beyond the minor refactor and tracing spans) is to not treat errors reading from log files as fatal to providing logs to the client.

The need for this change is because often log files may be corrupted either due to old bugs in dockerd or for other reasons.
One specific thing we have seen seems to be caused by ext4 by default is not doing full journaling. On power loss or kernel panic this can cause the log data to be corrupted.
We see this mostly on IoT devices which are often not on battery backup or frequently hard cut from power.

Instead of erroring out a stream on errors from a particular file this just skips on to the next file.

I had made a previous attempt to be a little more clever and scrub the file to try and find a place where we can start parsing again but realized this can cause serious issues and is generally error prone.

This also takes care of a long-standing todo to handle log files lazily (or at least as lazily as possible) so we only end up decompressing files just-in-time.

Unfortunately we don't have a way to return warnings to the client that we had to skip files, however with the tracing changes at least it could be seen there and in the daemon logs.

When reading logs with the `jsonfile` or `local` log drivers, any errors while trying to read or parse underlying log files will cause the rest of the file to be skipped and move to the next log file (if one exists) rather than returning an error to the client and closing the stream.
The errors are viewable in the dockerd logs and exported to traces when tracing is configured.

When reading log files, compressed log files are now only decompressed when needed rather than decompressing all files before starting the log stream.

cpuguy83 · 2024-06-18T18:36:02Z

The issue with Windows seems to be caused by the frequency with which time is sampled on Windows, which is by default every 15ms.
We can request greater precision from windows (e.g. 1ms which is what linux has) but this has performance implications.

I looked at using mod time but on linux filesystems the precision is in ms, not ns so its not reliable (tests fail on Linux here but pass on Windows, actually).

Trying to not have to digest every log file when we compress it.
Sticking the the time theme, we could add another field to our gzip header that uses a precise time that is just the time we rotated it for compression.

cpuguy83 · 2024-06-26T16:41:59Z

Ok I think this is ready.
@corhere I put back sharedtemp as discussed.

corhere

I have noticed that the refactor also makes it possible for log drivers to implement "skip to next file on decode failure" behavior themselves, simply by having their (loggerutils.Decoder).Decode() implementation return io.EOF instead of the JSON decode error or whatever. And drivers which use a self-synchronizing stream format could implement more intelligent recovery logic since the decoder is in charge of advancing the reader. But I have convinced myself that it is still worthwhile to lift the "skip file" logic into loggerutils because it works as a fallback for all log drivers, including hypothetical drivers with intelligent recovery logic. It'll be there as a fallback recovery mechanism if the driver's resynchronization logic falls over (returns an unexpected error) on a particularly corrupt file.

daemon/logger/loggerutils/logfile.go

daemon/logger/loggerutils/sharedtemp.go

corhere · 2024-06-28T18:31:47Z

=== FAIL: github.com/docker/docker/daemon/logger/loggerutils TestTailFiles/handle_corrupted_data (10.01s)
    logfile_test.go:189: timeout waiting for log message

cpuguy83 · 2024-07-05T18:01:15Z

So unfortunately its not that straight forward to have forwarder manage everything and still return just a bool because logfile following needs to have different error handling semantics from tailing.

What I ended p doing here is wrapping the decoder's Decode function from tailFiles to make it so if the underlying decoder returns an error then the error is wrapped in an io.EOF, which allows for this different error handling semantics.

daemon/logger/loggerutils/logfile.go

corhere · 2024-07-05T23:33:23Z

daemon/logger/journald/read.go

 }

-func (s *journald) ReadLogs(config logger.ReadConfig) *logger.LogWatcher {
+func (s *journald) ReadLogs(ctx context.Context, config logger.ReadConfig) *logger.LogWatcher {


Plumbing contexts fully into the journald reader with cancelation is pretty straightforward: corhere@b603933
Feel free to cherry-pick that commit into this PR. Otherwise I would be happy to rebase and open a follow-up PR once this one has landed.

Pulled in that commit (with some minor changes).

daemon/logger/loggerutils/logfile.go

Times cannot be compared with `==` and instead should use the `t.Equal` function. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

When there is an error in parsing an individual log file just close the log and move on to the next one instead of erroring our the entire request. I investigated trying to error correct and scan ahead for corrupted log files but found this is too much of a risk of parsing things we shouldn't be and hence why this is just dropping the rest of the file. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

This plumbs a context down the stack and handles cancellation as needed so that we can have correlated traces from the API. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

cpuguy83 · 2024-07-22T19:23:52Z

@corhere I think I've addressed everything here.

daemon/logger/loggerutils/logfile.go

vvoland

Some surficial comments

vvoland · 2024-07-26T08:28:27Z

daemon/logs.go

+		if retErr != nil {
+			span.SetStatus(retErr)
+		}


nit: SetStatus checks for nil and acts accordingly on its own:

moby/vendor/github.com/containerd/containerd/tracing/tracing.go

Lines 99 to 108 in fcf03cd

// SetStatus sets the status of the current span.

// If an error is encountered, it records the error and sets span status to Error.

func (s *Span) SetStatus(err error) {

if err != nil {

s.otelSpan.RecordError(err)

s.otelSpan.SetStatus(codes.Error, err.Error())

} else {

s.otelSpan.SetStatus(codes.Ok, "")

}

}

Suggested change

if retErr != nil {

span.SetStatus(retErr)

}

span.SetStatus(retErr)

vvoland · 2024-07-26T08:32:06Z

daemon/logger/loggerutils/logfile.go

+			span.AddEvent("watch consumer gone")
+			return false
+		case <-ctx.Done():
+			span.AddEvent(ctx.Err().Error())


Why not use span.SetStatus?

I think the reason I did this was because its not really an error that the caller cancelled the context... but am fine either way.

daemon/logger/loggerutils/logfile.go

daemon/logger/loggerutils/logfile_test.go

This allows for an individual decode operation to be cancelled while the log reader is reading data from a log file by closing the underlying file. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

Co-authored-by:: Cory Snider <csnider@mirantis.com> Signed-off-by: Brian Goff <cpuguy83@gmail.com>

vvoland · 2024-08-23T09:39:43Z

Hmmm, interesting panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x556232a85fdd]

goroutine 57732 [running]:
github.com/docker/docker/daemon/containerd.(*ImageService).imageSummary.func1.2(0xc0003c9b80?)
	/go/src/github.com/docker/docker/daemon/containerd/image_list.go:358 +0xfd
github.com/docker/docker/container.(*memoryStore).ApplyAll.func1(0x0?)
	/go/src/github.com/docker/docker/container/memory_store.go:77 +0x23
created by github.com/docker/docker/container.(*memoryStore).ApplyAll in goroutine 57632
	/go/src/github.com/docker/docker/container/memory_store.go:76 +0x52
[rootlesskit:child ] error: command [/usr/local/bin/dockerd-rootless.sh --debug --host unix:///go/src/github.com/docker/docker/bundles/test-integration/docker.sock --storage-driver overlayfs --pidfile bundles/test-integration/docker.pid --userland-proxy=true --experimental] exited: exit status 2

Unrelated to this PR, but what's strange here this stack trace matches the code from master, but this PR is based off a master at the time where this code was not present yet.

github.com/docker/docker/daemon/containerd.(*ImageService).imageSummary.func1.2(0xc0003c9b80?)
/go/src/github.com/docker/docker/daemon/containerd/image_list.go:358 +0xfd

Code from this PR:

moby/daemon/containerd/image_list.go

Lines 356 to 358 in 6d94122

    
           } 
        
           // totalSize is the size of the image's packed layers and snapshots

Code from master:

moby/daemon/containerd/image_list.go

Lines 356 to 358 in d8f079d

    
           i.containers.ApplyAll(func(c *container.Container) { 
        
           	if c.ImageManifest != nil && c.ImageManifest.Digest == target.Digest { 
        
           		mfstSummary.ImageData.Containers = append(mfstSummary.ImageData.Containers, c.ID)

vvoland

LGTM; but please try to add a changelog description 😅

cpuguy83 · 2024-08-26T17:33:49Z

Added changelog description.

cpuguy83 · 2024-11-09T17:17:55Z

Opened backport PR here: #48842

thaJeztah · 2024-12-12T13:49:18Z

Cross-referencing, in case this patch would make its way in other branches;

cpuguy83 force-pushed the skip_corrupted_log_files branch 5 times, most recently from 94d4f6d to 32fed58 Compare June 14, 2024 22:15

cpuguy83 force-pushed the skip_corrupted_log_files branch 5 times, most recently from e903561 to 60b09de Compare June 25, 2024 21:37

cpuguy83 requested review from corhere and vvoland June 26, 2024 16:41

corhere reviewed Jun 27, 2024

View reviewed changes

daemon/logger/loggerutils/logfile.go Outdated Show resolved Hide resolved

daemon/logger/loggerutils/sharedtemp.go Outdated Show resolved Hide resolved

cpuguy83 force-pushed the skip_corrupted_log_files branch 2 times, most recently from 5bf148f to d47c742 Compare June 27, 2024 21:00

vvoland added area/logging status/2-code-review kind/bugfix PR's that fix bugs labels Jul 5, 2024

vvoland added this to the 28.0.0 milestone Jul 5, 2024

vvoland added the impact/changelog label Jul 5, 2024

cpuguy83 force-pushed the skip_corrupted_log_files branch from d47c742 to c99cdbb Compare July 5, 2024 17:57

cpuguy83 force-pushed the skip_corrupted_log_files branch from c99cdbb to 1be9331 Compare July 5, 2024 18:02

corhere reviewed Jul 5, 2024

View reviewed changes

cpuguy83 force-pushed the skip_corrupted_log_files branch 2 times, most recently from 4d4679d to 7df30a5 Compare July 22, 2024 19:19

Fix time comparison in sharedtemp implementation

b37c8a0

Times cannot be compared with `==` and instead should use the `t.Equal` function. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

cpuguy83 added 2 commits July 22, 2024 19:20

Logfile: Add tracing spans

dbf6873

This plumbs a context down the stack and handles cancellation as needed so that we can have correlated traces from the API. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

cpuguy83 force-pushed the skip_corrupted_log_files branch from 7df30a5 to 8b40a64 Compare July 22, 2024 19:22

cpuguy83 force-pushed the skip_corrupted_log_files branch from 8b40a64 to 8df4553 Compare July 22, 2024 19:27

corhere approved these changes Jul 25, 2024

View reviewed changes

daemon/logger/loggerutils/logfile.go Outdated Show resolved Hide resolved

vvoland reviewed Jul 26, 2024

View reviewed changes

cpuguy83 and others added 2 commits August 22, 2024 21:33

logfile: Close reader when caller cancels

9b6ba18

This allows for an individual decode operation to be cancelled while the log reader is reading data from a log file by closing the underlying file. Signed-off-by: Brian Goff <cpuguy83@gmail.com>

logger/journald: plumb contexts into reader

6d94122

Co-authored-by:: Cory Snider <csnider@mirantis.com> Signed-off-by: Brian Goff <cpuguy83@gmail.com>

cpuguy83 force-pushed the skip_corrupted_log_files branch from 8df4553 to 6d94122 Compare August 22, 2024 21:34

vvoland mentioned this pull request Aug 23, 2024

c8d/list: Fix race condition when traversing containers #48367

Merged

vvoland approved these changes Aug 23, 2024

View reviewed changes

vvoland mentioned this pull request Aug 23, 2024

[27.x backport] c8d/list: Fix race condition when traversing containers #48369

Merged

cpuguy83 requested a review from corhere August 30, 2024 21:30

corhere approved these changes Aug 30, 2024

View reviewed changes

cpuguy83 merged commit 55752bb into moby:master Sep 3, 2024

cpuguy83 deleted the skip_corrupted_log_files branch September 3, 2024 15:58

thaJeztah mentioned this pull request Sep 6, 2024

add more //go:build directives to prevent downgrading to go1.16 #48434

Merged

cpuguy83 mentioned this pull request Sep 28, 2024

Error grabbing logs: invalid character '\x00' looking for beginning of value docker/for-linux#140

Closed

3 tasks

cpuguy83 mentioned this pull request Nov 9, 2024

[27.x]: Make log reading more robust to errors #48842

Merged

cpuguy83 added the process/cherry-pick/27.x label Nov 9, 2024

thaJeztah added process/cherry-picked and removed process/cherry-pick/27.x labels Nov 13, 2024

thaJeztah mentioned this pull request Dec 12, 2024

otel: Avoid excessive memory allocations if not configured #49078

Merged

vvoland mentioned this pull request Dec 12, 2024

[27.x backport] otel: Avoid excessive memory allocations if not configured #49079

Merged

cpuguy83 mentioned this pull request May 28, 2025

Docker logs crashes every couple of hours #46699

Closed

	// SetStatus sets the status of the current span.
	// If an error is encountered, it records the error and sets span status to Error.
	func (s *Span) SetStatus(err error) {
	if err != nil {
	s.otelSpan.RecordError(err)
	s.otelSpan.SetStatus(codes.Error, err.Error())
	} else {
	s.otelSpan.SetStatus(codes.Ok, "")
	}
	}

Conversation

cpuguy83 commented Jun 14, 2024 • edited by thaJeztah Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuguy83 commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpuguy83 commented Jun 26, 2024

Uh oh!

corhere left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

corhere commented Jun 28, 2024

Uh oh!

cpuguy83 commented Jul 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

corhere Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

cpuguy83 Jul 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpuguy83 commented Jul 22, 2024

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

vvoland Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

vvoland Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

cpuguy83 Aug 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vvoland commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

cpuguy83 commented Aug 26, 2024

Uh oh!

cpuguy83 commented Nov 9, 2024

Uh oh!

thaJeztah commented Dec 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cpuguy83 commented Jun 14, 2024 •

edited by thaJeztah

Loading

cpuguy83 commented Jun 18, 2024 •

edited

Loading

vvoland commented Aug 23, 2024 •

edited

Loading