in_tail pos_file_compaction_interval corrupts position files

**Describe the bug**
when using the in_tail position file compaction via `pos_file_compaction_interval`  the position files become corrupted

**To Reproduce**
Reproduced on a kubernetes cluster running compaction every 24 seconds and several pods in crashloop backoff so there was stuff to compact.

**Expected behavior**
no corruption

**Your Environment**

    Fluentd or td-agent version: 1.9.2 td-agent 3.6.0
    Kernel version: uname -r

**Your Configuration**
standard in_tail configuration with `pos_file_compaction_interval 24`


**Your Error Log**
The corrupted position file looks like this, note the `ffffffffffffffff` part with the missing space

```
/var/log/containers/container-1c15005e5e6978f4652c9bcce679f416654fb6d9e304a0d5d3b476b3c4bfd734.log	0000000000031a96	00000000008f589b
ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log	0000000000016215	00000000008f58ca
/var/log/containers/container-319527df0aaf8a2c698c775340981126709f67e05e29566ab7689d72001a7b43.log	00000000000330ed	00000000008f5879
```

there are actually lots of null bytes added in the problematic spot

```
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log\t0000000000016215\t00000000008f58ca\n
```


The problem is likely caused by a race condition in `try_compact` during the `fetch_compacted_entries` call
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail/position_file.rb#L90

that call is performed outside of the lock but it reads the file, this means it can read file currently being modified by another thread and the writes of the position file are not atomic on the filesystem level.

this could either to move the fetch into the mutex lock or make the position file writes atomic via `rename´

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_tail pos_file_compaction_interval corrupts position files #2918

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

in_tail pos_file_compaction_interval corrupts position files #2918

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions