Skip to content

Journald position file got corrupted during multiple fluentd gracefulReloads #3332

@alex-vmw

Description

@alex-vmw

Describe the bug
We were performing tests that were issuing multiple /api/config.gracefulReload calls to fluentd. After a few reloads we noticed that on a single node the journald position file got corrupted.

Error that we saw:

2021-04-08 00:06:50 +0000 [error]: [in_systemd_docker] failed to read data from plugin storage file path="/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json" error_class=Yajl::ParseError error="lexical error: invalid char in json text.\n                                     {}journal\":\"s=f32000cba2bf4de4964\n                     (right here) ------^\n"
2021-04-08 00:06:50 +0000 [error]: fluent/log.rb:369:error: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Unexpected error: failed to read data from plugin storage file: '/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json'"
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:99:in `rescue in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:89:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:111:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:51:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:317:in `add_source'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:158:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:105:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:80:in `run_configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/supervisor.rb:551:in `run_supervisor'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/command/fluentd.rb:341:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/bin/fluentd:8:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `load'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `<main>'

What we saw in the corrupted file was this:

$ cat /var/log/-kube-fluentd-operator--fluentd-journald-cursor.json
{}journal":"s=f32000cba2bf4de4964576140ffd3661;i=19b518;b=b80855c9fe2d410b8b5d6fff5326c16d;m=136787196bf6;t=5bf6a3ba60f5b;x=7c49636868ee757c"}

As you can see, the {} in front was supposed to be {" instead.

To Reproduce
Restart fluentd many times via /api/config.gracefulReload call.

Expected behavior
journald position file shouldn't get corrupted with an invalid character.

Your Environment

  • Fluentd or td-agent version: fluentd 1.11.2
  • Operating system: Ubuntu 20.04.1 LTS
  • Kernel version: 5.4.0-62-generic

If you hit the problem with older fluentd version, try latest version first.
We tested /api/config.gracefulReload calls with different fluentd versions, but only hit this issue once, so I am guessing it is a pretty rare race condition. We decided to report it anyways, just in case.

Your Configuration

<ROOT>
  <system>
    log_level info
    rpc_endpoint "127.0.0.1:24444"
  </system>
  <source>
    @type systemd
    @id in_systemd_docker
    path "/var/log/journal"
    tag "systemd.unit"
    read_from_head false
    <storage>
      @type "local"
      persistent true
      path "/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json"
    </storage>
    <entry>
      field_map {"_SYSTEMD_UNIT":"unit","MESSAGE":"log","_PID":["pid"],"_PRIORITY":"priority","_COMM":"cmd","_HOSTNAME":"hostname"}
      field_map_strict true
      fields_lowercase true
    </entry>
  </source>
  <match systemd.unit>
    @type rewrite_tag_filter
    <rule>
      key "unit"
      pattern ^(.+)$
      tag "systemd.$1"
    </rule>
  </match>
  <filter systemd.kubelet.service>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_container_logs
    path "/var/log/containers/*.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-containers.log.pos"
    tag "kubernetes.*"
    read_from_head true
    <parse>
      @type "multiline"
      format1 /^(?<partials>([^\n]+ (stdout|stderr) P [^\n]+\n)*)/
      format2 /(?<time>[^\n]+) (?<stream>stdout|stderr) F (?<log>[^\n]*)/
      format3 /|(?<json>{.*})/
      time_format "%Y-%m-%dT%H:%M:%S.%N%:z"
    </parse>
  </source>
  <filter kubernetes.**>
    @type record_transformer
    @id filter_crio_container_logs
    enable_ruby true
    remove_keys partials
    <record>
      log ${record["partials"]&.gsub(/.+ (stdout|stderr) P (.+)\n/, '\\2')}${record["log"]}
    </record>
  </filter>
  <filter kubernetes.**>
    @type parser
    @id filter_docker_container_logs
    key_name "json"
    remove_key_name_field true
    reserve_data true
    emit_invalid_record_to_error false
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      time_type string
    </parse>
  </filter>
  <source>
    @type tail
    @id in_tail_startupscript
    path "/var/log/startupscript.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-startupscript.log.pos"
    tag "startupscript"
    <parse>
      @type "syslog"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_docker
    path "/var/log/docker.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-docker.log.pos"
    tag "docker"
    <parse>
      @type "regexp"
      expression /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_etcd
    path "/var/log/etcd.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-etcd.log.pos"
    tag "k8s.etcd"
    <parse>
      @type "none"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kubelet
    multiline_flush_interval 5s
    path "/var/log/kubelet.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kubelet.log.pos"
    tag "k8s.kubelet"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_proxy
    multiline_flush_interval 5s
    path "/var/log/kube-proxy.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-proxy.log.pos"
    tag "k8s.kube-proxy"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver
    multiline_flush_interval 5s
    path "/var/log/kube-apiserver.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-apiserver.log.pos"
    tag "k8s.kube-apiserver"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_controller_manager
    multiline_flush_interval 5s
    path "/var/log/kube-controller-manager.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-controller-manager.log.pos"
    tag "k8s.kube-controller-manager"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_scheduler
    multiline_flush_interval 5s
    path "/var/log/kube-scheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-kube-scheduler.log.pos"
    tag "k8s.kube-scheduler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_rescheduler
    multiline_flush_interval 5s
    path "/var/log/rescheduler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-rescheduler.log.pos"
    tag "k8s.rescheduler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_glbc
    multiline_flush_interval 5s
    path "/var/log/glbc.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-glbc.log.pos"
    tag "k8s.glbc"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_cluster_autoscaler
    multiline_flush_interval 5s
    path "/var/log/cluster-autoscaler.log"
    pos_file "/var/log/-kube-fluentd-operator--fluentd-cluster-autoscaler.log.pos"
    tag "k8s.cluster-autoscaler"
    <parse>
      @type "kubernetes"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
      time_format "%m%d %H:%M:%S.%N"
    </parse>
  </source>
  <source>
    @type tail
    @id in_tail_kube_apiserver_audit
    multiline_flush_interval 5s
    path "/var/log/kubernetes/kube-apiserver-audit.log"
    pos_file "/var/log/-kube-fluentd-operator--kube-apiserver-audit.log.pos"
    tag "k8s.kube-apiserver-audit"
    <parse>
      @type "multiline"
      format_firstline "/^\\S+\\s+AUDIT:/"
      format1 /^(?<time>\S+) AUDIT:(?: (?:id="(?<id>(?:[^"\\]|\\.)*)"|ip="(?<ip>(?:[^"\\]|\\.)*)"|method="(?<method>(?:[^"\\]|\\.)*)"|user="(?<user>(?:[^"\\]                                                                                                                                                                |\\.)*)"|groups="(?<groups>(?:[^"\\]|\\.)*)"|as="(?<as>(?:[^"\\]|\\.)*)"|asgroups="(?<asgroups>(?:[^"\\]|\\.)*)"|namespace="(?<namespace>(?:[^"\\]|\\.)*)"|ur                                                                                                                                                                i="(?<uri>(?:[^"\\]|\\.)*)"|response="(?<response>(?:[^"\\]|\\.)*)"|\w+="(?:[^"\\]|\\.)*"))*/
      time_format "%Y-%m-%dT%T.%L%Z"
    </parse>
  </source>
  <filter kubernetes.**>
    @type kubernetes_metadata
    @id filter_kube_metadata
  </filter>
  <filter kubernetes.**>
    @type kubernetes_metadata
  </filter>
  <filter kubernetes.**>
    @type record_transformer
    enable_ruby true
    <record>
      kubernetes_namespace_container_name ${record["kubernetes"]["namespace_name"]}.${record["kubernetes"]["pod_name"]}.${record["kubernetes"]["container_nam                                                                                                                                                                e"]}
      container_info ${record["docker"]["container_id"]}-${record["stream"]}
    </record>
  </filter>
  <match kubernetes.**>
    @type rewrite_tag_filter
    <rule>
      key "kubernetes_namespace_container_name"
      pattern ^(.+)$
      tag "kube.$1"
    </rule>
  </match>
  <filter kube.*.*.*>
    @type record_modifier
    remove_keys "dummy_"
    <record>
      dummy_ ${record["kubernetes"]&.delete("master_url"); record["kubernetes"]&.delete("namespace_id"); if record["kubernetes"]&.has_key?("labels"); record[                                                                                                                                                                "kubernetes"]["labels"].delete("pod-template-generation"); record["kubernetes"]["labels"].delete("controller-revision-hash");  record["kubernetes"]["labels"]                                                                                                                                                                .delete("pod-template-hash"); end; nil}
    </record>
  </filter>
  <filter kube.*.*.*>
    @type record_transformer
    remove_keys kubernetes_namespace_container_name
  </filter>
  <filter kube.kube-system.**>
    @type parser
    reserve_data true
    key_name "log"
    emit_invalid_record_to_error false
    <parse>
      @type "kubernetes"
      time_format "%m%d %H:%M:%S.%N"
      expression /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/m
    </parse>
  </filter>
  <source>
    @type prometheus
  </source>
  <source>
    @type prometheus_monitor
  </source>
  <source>
    @type prometheus_output_monitor
  </source>
  <match systemd.** kube.kube-system.**>
    @type copy
    <store>
      @id X
      @type "vmware_loginsight"
      host "X.X.X.X"
      include_tag_key true
      log_text_keys ["log","msg","message"]
      port 9000
      scheme "http"
      serializer "json"
      ssl_verify false
      tag_key "tag"
    </store>
  </match>
  <match **>
    @type null
  </match>
</ROOT>

Your Error Log

2021-04-08 00:06:50 +0000 [error]: [in_systemd_docker] failed to read data from plugin storage file path="/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json" error_class=Yajl::ParseError error="lexical error: invalid char in json text.\n                                     {}journal\":\"s=f32000cba2bf4de4964\n                     (right here) ------^\n"
2021-04-08 00:06:50 +0000 [error]: fluent/log.rb:369:error: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Unexpected error: failed to read data from plugin storage file: '/var/log/-kube-fluentd-operator--fluentd-journald-cursor.json'"
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:99:in `rescue in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin/storage_local.rb:89:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:111:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin_helper/storage.rb:103:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluent-plugin-systemd-1.0.2/lib/fluent/plugin/in_systemd.rb:51:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/plugin.rb:173:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:317:in `add_source'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:158:in `block in configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `each'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/root_agent.rb:152:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:105:in `configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/engine.rb:80:in `run_configure'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/supervisor.rb:551:in `run_supervisor'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/lib/fluent/command/fluentd.rb:341:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:72:in `require'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/lib/ruby/gems/2.7.0/gems/fluentd-1.11.2/bin/fluentd:8:in `<top (required)>'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `load'
  2021-04-08 00:06:50 +0000 [debug]: core_ext/kernel_require.rb:72:require: /usr/local/bin/fluentd:23:in `<main>'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions