Describe the bug
when using the in_tail position file compaction via pos_file_compaction_interval the position files become corrupted
To Reproduce
Reproduced on a kubernetes cluster running compaction every 24 seconds and several pods in crashloop backoff so there was stuff to compact.
Expected behavior
no corruption
Your Environment
Fluentd or td-agent version: 1.9.2 td-agent 3.6.0
Kernel version: uname -r
Your Configuration
standard in_tail configuration with pos_file_compaction_interval 24
Your Error Log
The corrupted position file looks like this, note the ffffffffffffffff part with the missing space
/var/log/containers/container-1c15005e5e6978f4652c9bcce679f416654fb6d9e304a0d5d3b476b3c4bfd734.log 0000000000031a96 00000000008f589b
ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log 0000000000016215 00000000008f58ca
/var/log/containers/container-319527df0aaf8a2c698c775340981126709f67e05e29566ab7689d72001a7b43.log 00000000000330ed 00000000008f5879
there are actually lots of null bytes added in the problematic spot
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log\t0000000000016215\t00000000008f58ca\n
The problem is likely caused by a race condition in try_compact during the fetch_compacted_entries call
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail/position_file.rb#L90
that call is performed outside of the lock but it reads the file, this means it can read file currently being modified by another thread and the writes of the position file are not atomic on the filesystem level.
this could either to move the fetch into the mutex lock or make the position file writes atomic via `rename´
Describe the bug
when using the in_tail position file compaction via
pos_file_compaction_intervalthe position files become corruptedTo Reproduce
Reproduced on a kubernetes cluster running compaction every 24 seconds and several pods in crashloop backoff so there was stuff to compact.
Expected behavior
no corruption
Your Environment
Your Configuration
standard in_tail configuration with
pos_file_compaction_interval 24Your Error Log
The corrupted position file looks like this, note the
ffffffffffffffffpart with the missing spacethere are actually lots of null bytes added in the problematic spot
The problem is likely caused by a race condition in
try_compactduring thefetch_compacted_entriescallhttps://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail/position_file.rb#L90
that call is performed outside of the lock but it reads the file, this means it can read file currently being modified by another thread and the writes of the position file are not atomic on the filesystem level.
this could either to move the fetch into the mutex lock or make the position file writes atomic via `rename´