Skip to content

Fluentd keeps Windows k8s pods from shutting down #3340

@dotsuber

Description

@dotsuber

Describe the bug
I'm using Fluentd version 1.11.2 in EKS in Windows VMs. Sometimes pods get stuck in "terminating" status for hours and days, until I restart the VM. I checked the kubelet and the docker logs, and it seems like a log file which refuses to delete. Once I delete the Fluentd pod, the terminating pod succeed to delete. In my opinion, the root cause is the Fluentd which lock the file.
Here is a thread HERE which describes the same problem in Fluent-bit, and the last comment says it solved in the latest version.

To Reproduce
I didn't succeed to reproduce the bug, it just happen to me once in few days.

Expected behavior
I expect to have some fix in Fluentd like they did in Fluent-bit, which fixes the lock on files.

Your Environment

  • Fluentd version: 1.11.2
  • Operating system: Windows server 2019

Your Configuration

<source>
  @type tail
  @log_level info
  path /var/log/containers/*.log
  exclude_path ["/var/log/containers/log-tailer-*", "/var/log/containers/geneva-logger-*"]
  pos_file /var/log/fluentd-docker.pos
  tag kubernetes.*
  format json
  time_key time
  time_format %Y-%m-%dT%H:%M:%S
  read_from_head true
</source>

# Don't care about fluentd logs
<match **fluentd**.log>
  @type null
</match>

<filter kubernetes.var.log.containers.**>
  @type kubernetes_metadata
</filter>

<match kubernetes.var.log.containers.**>
  @type rewrite_tag_filter
  # <rule>
  #   key log
  #   pattern .+
  #   tag kubernetes.prod
  # </rule>
  <rule>
    key $['kubernetes']['namespace_name']
    pattern ^(.+)$
    tag kubernetes.$1
  </rule>
</match>

<match kubernetes.**>
  @type forward
  require_ack_response true
  expire_dns_cache 300
  <buffer>
    @type file
    path /var/log/td-agent/buffer/kubernetes.{{ .Values.geneva.chartNamespace }}
    chunk_limit_size 4m
    queued_chunks_limit_size 4096
    flush_interval 10s
    flush_thread_count 8
    retry_max_times 8
    retry_timeout 5m
  </buffer>
  <server>
    host {{ printf "%s.%s.svc.cluster.local" .Values.geneva.service.name .Values.geneva.chartNamespace }}
    port {{ .Values.geneva.fluentd.port }}
  </server>
{{- if .Values.certificate }}
  transport tls
  tls_cert_path C:\tmp\fluentd\secrets\fluentd-cert.pem
  tls_allow_self_signed_cert true
  tls_version TLSv1_2
{{- end }}
</match>

E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem
on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.

Additional context
Is there any update or fix for that problem like in Fluent-bit?

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingwindows

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions