Skip to content

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

@jgehrcke

Description

@jgehrcke

Describe the bug

The in_tail plugin might discover a new log file, and then crash the worker upon the stat() system call due to ENOENT:

[info]: gem 'fluentd' version '1.12.0'

...

[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:200:in `block in start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:234:in `start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:353:in `refresh_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `each_value'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `block in start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `stat'
[error]: #0 unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_stat - /var/log/containers/foo-55fc795779-l66hb_foo_foo-ee7d580da4f6fac071b1e0fa8533fcada239a88404ed2fcde16480fbcd4a39fb.log"

I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.

To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).

Expected behavior

At the system call boundary between application and OS the individual system calls like stat() / open() should always be wrapped with local error handling, expecting the system call to fail.

This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.

An appropriate action would be to log a warning/error showing how the system call failed.

Your Environment

Fluentd 1.12.0 on Linux

Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.

Your Configuration

...

Your Error Log

see above

Additional context

The stat() call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409

Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409

The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions