Fix Registrar not removing files when clean_removed is configured#10747
Fix Registrar not removing files when clean_removed is configured#10747urso merged 10 commits intoelastic:masterfrom
Conversation
filebeat/input/log/harvester.go
Outdated
There was a problem hiding this comment.
Can you elaborate on this move? I would assume the harvester itself should be the first one to know about state updates?
There was a problem hiding this comment.
Will add my findings to the PR description.
The Update here is no to the harvester, but to the prospector. Thanks to shutdown racing with scan phase we sometimes send finish and remove events out of order if publishState is not called before states.Update.
2186da4 to
ee8c565
Compare
|
Windows is messing with me. The file can not be opened (access denied), which leads to the prospector never cleaning up it's state: |
libbeat/common/file/file_windows.go
Outdated
There was a problem hiding this comment.
exported var ErrDeletePending should have comment or be unexported
|
Investigating the windows case: I updated detection of file removal on windows by calling GetFileInformationByHandleEx. This call allows us to read the This change only slightly improves the validation of Note: I wonder if we can make close_removed more 'stable' by creating a hardlink on windows. But this might not work if network shares or different volumes are used. |
filebeat/input/log/stdin.go
Outdated
There was a problem hiding this comment.
exported method Pipe.Removed should have comment or be unexported
ruflin
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the investigation and all the details around it.
I like the abstraction in the tests, this could be used in many other places to simplify the existing tests I'm thinking.
- Could you add a changelog entry?
- I assume this will also need to be backported to 7.0 and 6.6 / 6.7?
Still trying to fully understand the impact. close_, clean_removed were not working in the cases where the file was still opened by an other service, if that was not the case, it worked as intended?
On linux a file gets 'unlinked'. This means a file is not visible in the file system anymore, but the inode and data is cleared after the last process has open the file handle. After unlinking a new file with the very same name can be written to. On windows a file is not 'unlinked', but the 'Delete Pending' flag is set. The file continues to exist within the file system (can still be seen via Glob) as long as at least one process has an open file handle. The 'Delete Pending' flag is not accessible/accounted for when listing a directory or calling After stabilizing the tests, the Good news is: filebeat windows tests have been green after fixing |
|
Jenkins, test this. |
Still in progress. It has been a rabid hole so far. Let's see if I find more :)
+1 |
|
Jenkins, test this. |
22ae88e to
3358a74
Compare
|
Filebeat tests have been green 5 times in a row on all platforms. |
|
This pull request doesn't have a |
Closes: #7690, #9215, #10606
This PR includes some changes that also did help to clean/stabilize the
text_clean_removed*tests. Often times the tests have been stopped to early, due to races between the input file states, registrar states, and test code.After stabilizing the tests, I reliably (still needs a few runs) can trigger a data-synchronisation issue between the prospectors state and the registry. This is clearly a bug and results in state not being removed from the registry file at all.
After some more testing (yay, heisenbug :/) I found a race condition between harvester shutdown, prospector state cleaning (e.g. if clean_removed is set) and registry updates. The sequence of events goes like this:
The change proposed changes the order the harvester cleanup propagates updates by sending the 'finished event' first before updating the prospectors state. This way we postpone the file being cleaned by the prospector and guarantee the 'finished event' and 'remove event' order (The prospector can not remove the state yet, because Finished is not set until after the 'finished event' has been published).
Additions:
self.registry()). This class is used to check/read the registrywith open(...) as f:blocks.checkpointor is automatically advanced ifnextsucceeds. UsingLogStatewe can wait for a particular message to appear in the log, by ignoring old messages. This removes the need to have a count on actual messages.LogStateand some minor adjustments to timings I did shave of like 10s from test_registrar.py on my machine.FB success count: