Reopen journald reader when journal is corrupt#26116
Reopen journald reader when journal is corrupt#26116kvch wants to merge 2 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/agent (Team:Agent) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
| if errors.Is(err, syscall.EBADMSG) { | ||
| reader.Close() | ||
| goto OPEN_JOURNAL | ||
| } |
There was a problem hiding this comment.
When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?
How does the reopen affect the Seek call right after reopening? The checkpoint still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?
There was a problem hiding this comment.
When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?
I am not sure I understand your question. When a file becomes corrupted and journald realizes it, it rotates the file. Rotation does mean creating a new file and saving the corrupt one with a new name. The new file is going to be empty until new log lines are written to it.
How does the reopen affect the
Seekcall right after reopening? Thecheckpointstill points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?
I have added seeking to the head of the journal just to be sure. But after further investigation with more corrupted journals I am not sure if this is the right thing to do. It seems that the reader is already handling these errors and skipping bad messages.
I open a journal, start tailing it, mess with the file to cause data corruption, I get 2-3 bad messages errors and then Journalbeat reads events as if nothing has happened. Maybe we should rather only log one bad messages error and suppress others until the reading of the events is normalized again?
There was a problem hiding this comment.
your goto OPEN_JOURNAL triggers an open + a seek instructions with seek being based the statestore state + the input configuration. If one configured the input to always restart collection on seek, we will restart collection from the oldest available logs. If one configured to always start collection from the latest available message, we have lost logs now. If one has configured to continue from the 'checkpoint', we will continue from the last 'offset', which is what we might want always.
What I wonder is: Do we need to ensure that the seek after reopening does not use the users initial configuration?
|
Closing in favour of #26224 |
What does this PR do?
When a Beat encounters a corrupt journal, it reopens the journal reader.
Why is it important?
When running Journalbeat for a while when it encounters a corrupt journal, it fills up the log with error messages. Now it requires manual intervention to clean up the status.
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Related issues
Closes #23627