Describe the bug
The in_tail plugin might discover a new log file, and then crash the worker upon the stat() system call due to ENOENT:
[info]: gem 'fluentd' version '1.12.0'
...
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:200:in `block in start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:234:in `start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:353:in `refresh_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `each_value'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `block in start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `stat'
[error]: #0 unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_stat - /var/log/containers/foo-55fc795779-l66hb_foo_foo-ee7d580da4f6fac071b1e0fa8533fcada239a88404ed2fcde16480fbcd4a39fb.log"
I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.
To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).
Expected behavior
At the system call boundary between application and OS the individual system calls like stat() / open() should always be wrapped with local error handling, expecting the system call to fail.
This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.
An appropriate action would be to log a warning/error showing how the system call failed.
Your Environment
Fluentd 1.12.0 on Linux
Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.
Your Configuration
...
Your Error Log
see above
Additional context
The stat() call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409
Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409
The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.
Describe the bug
The
in_tailplugin might discover a new log file, and then crash the worker upon thestat()system call due to ENOENT:I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.
To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).
Expected behavior
At the system call boundary between application and OS the individual system calls like
stat()/open()should always be wrapped with local error handling, expecting the system call to fail.This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.
An appropriate action would be to log a warning/error showing how the system call failed.
Your Environment
Fluentd 1.12.0 on Linux
Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.
Your Configuration
...Your Error Log
see above
Additional context
The
stat()call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409
The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.