1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file)

**Describe the bug**

The `in_tail` plugin might discover a new log file, and then crash the worker upon the `stat()` system call due to ENOENT:

```
[info]: gem 'fluentd' version '1.12.0'

...

[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:200:in `block in start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:234:in `start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:353:in `refresh_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `each_value'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `block in start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `stat'
[error]: #0 unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_stat - /var/log/containers/foo-55fc795779-l66hb_foo_foo-ee7d580da4f6fac071b1e0fa8533fcada239a88404ed2fcde16480fbcd4a39fb.log"
```

I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.

**To Reproduce**
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).


**Expected behavior**

At the system call boundary between application and OS the individual system calls like `stat()` / `open()` should always be wrapped with local error handling, expecting the system call to fail.

This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.

An appropriate action would be to log a warning/error showing how the system call failed.

**Your Environment**

Fluentd 1.12.0 on Linux

Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.

**Your Configuration**

```...```

**Your Error Log**

see above

**Additional context**


- This could be related to https://github.com/fluent/fluentd/issues/3224.
- This might be a regression introduced via https://github.com/fluent/fluentd/pull/3155 / https://github.com/fluent/fluentd/pull/2992

The `stat()` call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409

Looks like even before the last refactor this stat() wasn't protected: 
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409

The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. https://github.com/fluent/fluentd/issues/3224 has more detail about that.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions