Skip to content

daemon/logger/journald: quit waiting when the logger closes#47256

Merged
cpuguy83 merged 5 commits intomoby:masterfrom
corhere:journald/quit-when-youre-ahead
Feb 2, 2024
Merged

daemon/logger/journald: quit waiting when the logger closes#47256
cpuguy83 merged 5 commits intomoby:masterfrom
corhere:journald/quit-when-youre-ahead

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jan 30, 2024

- What I did
Fixed a regression in the journald log reader which manifested as rare test failures. The root cause was that it waited for nothing after the logger is closed. In the process I found and fixed some issues with the tests that were getting in the way of debugging. One of the fixes made the Follow tests consistently catch the bug without needing to write a new regression test.

- How I did it
I ran the journald tests repeatedly (e.g. using -count=100) in an attempt to replicate the TestLogRead/Follow/Tail=0 failure seen once in CI. While that test did not fail locally, TestLogRead/Follow/Concurrent would fail ~1% of the time with test output that provided almost no useful information. I ran the tests again with the race detector enabled and discovered that there was a data race in the journald reader harness. I fixed the data race to stop it from obscuring any data races in the journald code (there were no other data races detected) and to make sure that the race was not causing the tests to flake out. Fixing it had no effect on the TestConcurrent failure rate. So I made some improvements to TestConcurrent to make it easier to debug and to address deficiencies I had suspected of being the source of flakiness. One of the interim improvements was to add an explicit flush before closing the logger. It did fix the flakiness, but not in the way I had expected: the test was now consistently failing! That allowed me to run the tests under a debugger and identify the reader bug.

- How to verify it
go test -race ./daemon/logger/journald passes

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

The Go race detector was detecting a data race when running the
TestLogRead/Follow/Concurrent test against the journald logging driver.
The race was in the test harness, specifically syncLogger. The waitOn
field would be reassigned each time a log entry is sent to the journal,
which is not concurrency-safe. Make it concurrency-safe using the same
patterns that are used in the log follower implementation to synchronize
with the logger.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Writing the systemd-journal-remote command output directly to os.Stdout
and os.Stderr makes it nearly impossible to tell which test case the
output is related to when the tests are not run in verbose mode. Extend
the journald sender fake to redirect output to the test log so they
interleave with the rest of the test output.

Signed-off-by: Cory Snider <csnider@mirantis.com>
- Check the return value when logging messages
- Log the stream (stdout/stderr) and list of messages that were not read
- Wait until the logger is closed before returning early (panic/fatal)

Signed-off-by: Cory Snider <csnider@mirantis.com>
The journald reader test harness injects an artificial asynchronous
delay into the logging pipeline: a logged message won't be written to
the journal until at least 150ms after the Log() call returns. If a test
returns while log messages are still in flight to be written, the logs
may attempt to be written after the TempDir has been cleaned up, leading
to spurious errors.

The logger read tests which interleave writing and reading have to
include explicit synchronization points to work reliably with this delay
in place. On the other hand, tests should not be required to sync the
logger explicitly before returning. Override the Close() method in the
test harness wrapper to wait for in-flight logs to be flushed to disk.

Signed-off-by: Cory Snider <csnider@mirantis.com>
If a reader has caught up to the logger and is waiting for the next
message, it should stop waiting when the logger is closed. Otherwise
the reader will unnecessarily wait the full closedDrainTimeout for no
log messages to arrive.

This case was overlooked when the journald reader was recently
overhauled to be compatible with systemd 255, and the reader tests only
failed when a logical race happened to settle in such a way to exercise
the bugged code path. It was only after implicit flushing on close was
added to the journald test harness that the Follow tests would
repeatably fail due to this bug. (No new regression tests are needed.)

Signed-off-by: Cory Snider <csnider@mirantis.com>
@cpuguy83 cpuguy83 merged commit 7c4828f into moby:master Feb 2, 2024
@corhere corhere deleted the journald/quit-when-youre-ahead branch February 3, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants