daemon/logger/journald: quit waiting when the logger closes#47256
Merged
cpuguy83 merged 5 commits intomoby:masterfrom Feb 2, 2024
Merged
daemon/logger/journald: quit waiting when the logger closes#47256cpuguy83 merged 5 commits intomoby:masterfrom
cpuguy83 merged 5 commits intomoby:masterfrom
Conversation
The Go race detector was detecting a data race when running the TestLogRead/Follow/Concurrent test against the journald logging driver. The race was in the test harness, specifically syncLogger. The waitOn field would be reassigned each time a log entry is sent to the journal, which is not concurrency-safe. Make it concurrency-safe using the same patterns that are used in the log follower implementation to synchronize with the logger. Signed-off-by: Cory Snider <csnider@mirantis.com>
Writing the systemd-journal-remote command output directly to os.Stdout and os.Stderr makes it nearly impossible to tell which test case the output is related to when the tests are not run in verbose mode. Extend the journald sender fake to redirect output to the test log so they interleave with the rest of the test output. Signed-off-by: Cory Snider <csnider@mirantis.com>
- Check the return value when logging messages - Log the stream (stdout/stderr) and list of messages that were not read - Wait until the logger is closed before returning early (panic/fatal) Signed-off-by: Cory Snider <csnider@mirantis.com>
The journald reader test harness injects an artificial asynchronous delay into the logging pipeline: a logged message won't be written to the journal until at least 150ms after the Log() call returns. If a test returns while log messages are still in flight to be written, the logs may attempt to be written after the TempDir has been cleaned up, leading to spurious errors. The logger read tests which interleave writing and reading have to include explicit synchronization points to work reliably with this delay in place. On the other hand, tests should not be required to sync the logger explicitly before returning. Override the Close() method in the test harness wrapper to wait for in-flight logs to be flushed to disk. Signed-off-by: Cory Snider <csnider@mirantis.com>
If a reader has caught up to the logger and is waiting for the next message, it should stop waiting when the logger is closed. Otherwise the reader will unnecessarily wait the full closedDrainTimeout for no log messages to arrive. This case was overlooked when the journald reader was recently overhauled to be compatible with systemd 255, and the reader tests only failed when a logical race happened to settle in such a way to exercise the bugged code path. It was only after implicit flushing on close was added to the journald test harness that the Follow tests would repeatably fail due to this bug. (No new regression tests are needed.) Signed-off-by: Cory Snider <csnider@mirantis.com>
vvoland
approved these changes
Jan 31, 2024
3 tasks
cpuguy83
approved these changes
Feb 2, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
- What I did
Fixed a regression in the journald log reader which manifested as rare test failures. The root cause was that it waited for nothing after the logger is closed. In the process I found and fixed some issues with the tests that were getting in the way of debugging. One of the fixes made the
Followtests consistently catch the bug without needing to write a new regression test.- How I did it
I ran the journald tests repeatedly (e.g. using
-count=100) in an attempt to replicate theTestLogRead/Follow/Tail=0failure seen once in CI. While that test did not fail locally,TestLogRead/Follow/Concurrentwould fail ~1% of the time with test output that provided almost no useful information. I ran the tests again with the race detector enabled and discovered that there was a data race in the journald reader harness. I fixed the data race to stop it from obscuring any data races in the journald code (there were no other data races detected) and to make sure that the race was not causing the tests to flake out. Fixing it had no effect on the TestConcurrent failure rate. So I made some improvements to TestConcurrent to make it easier to debug and to address deficiencies I had suspected of being the source of flakiness. One of the interim improvements was to add an explicit flush before closing the logger. It did fix the flakiness, but not in the way I had expected: the test was now consistently failing! That allowed me to run the tests under a debugger and identify the reader bug.- How to verify it
go test -race ./daemon/logger/journaldpasses- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)