Skip to content

Segfault with amqp plugin since 1.21.3 which also corrupts listening sockets #10634

@norg

Description

@norg

Relevant telegraf.conf

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = "SECRET"
  omit_hostname = false

[[outputs.amqp]]
    brokers = ["amqp://SECRET:5672/"]
    exchange = "SECRET"
    
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]

[[inputs.kernel]]

[[inputs.mem]]

[[inputs.processes]]

[[inputs.swap]]

[[inputs.system]]

[[inputs.net]]

[[inputs.suricata]]
    source = "/tmp/stats.sock"
    delimiter = ".

Logs from Telegraf

Feb 08 07:57:49 SECRET systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Feb 08 07:57:49 SECRET telegraf[2600218]: 2022-02-08T07:57:49Z I! Starting Telegraf 1.21.3
Feb 10 18:17:45 SECRET telegraf[2600218]: panic: runtime error: invalid memory address or nil pointer dereference
Feb 10 18:17:45 SECRET telegraf[2600218]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x30d880f]
Feb 10 18:17:45 SECRET telegraf[2600218]: goroutine 1008013 [running]:
Feb 10 18:17:45 SECRET telegraf[2600218]: github.com/influxdata/telegraf/plugins/outputs/amqp.(*AMQP).Write(0xc000889200, {0xc00541a000, 0x3e8, 0x3e8})
Feb 10 18:17:45 SECRET telegraf[2600218]:         /go/src/github.com/influxdata/telegraf/plugins/outputs/amqp/amqp.go:263 +0x34f
Feb 10 18:17:45 SECRET telegraf[2600218]: github.com/influxdata/telegraf/models.(*RunningOutput).write(0xc0003d4c80, {0xc00541a000, 0x3e8, 0x3e8})
Feb 10 18:17:45 SECRET telegraf[2600218]:         /go/src/github.com/influxdata/telegraf/models/running_output.go:244 +0x118
Feb 10 18:17:45 SECRET telegraf[2600218]: github.com/influxdata/telegraf/models.(*RunningOutput).WriteBatch(0xc0003d4c80)
Feb 10 18:17:45 SECRET telegraf[2600218]:         /go/src/github.com/influxdata/telegraf/models/running_output.go:218 +0x58
Feb 10 18:17:45 SECRET telegraf[2600218]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1()
Feb 10 18:17:45 SECRET telegraf[2600218]:         /go/src/github.com/influxdata/telegraf/agent/agent.go:829 +0x29
Feb 10 18:17:45 SECRET telegraf[2600218]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Feb 10 18:17:45 SECRET telegraf[2600218]:         /go/src/github.com/influxdata/telegraf/agent/agent.go:828 +0xb8
Feb 10 18:17:45 SECRET systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 10 18:17:45 SECRET systemd[1]: telegraf.service: Failed with result 'exit-code'.
Feb 10 18:17:45 SECRET systemd[1]: telegraf.service: Consumed 1h 42min 35.123s CPU time.
Feb 10 18:17:45 SECRET systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 1.
Feb 10 18:17:45 SECRET systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Feb 10 18:17:45 SECRET systemd[1]: telegraf.service: Consumed 1h 42min 35.123s CPU time.
Feb 10 18:17:45 SECRET systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Feb 10 18:17:45 SECRET telegraf[4149180]: 2022-02-10T18:17:45Z I! Starting Telegraf 1.21.3
Feb 10 18:18:00 SECRET systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Feb 10 18:18:00 SECRET systemd[1]: telegraf.service: Failed with result 'exit-code'.

System info

Telegraf 1.21.3, Debian 11 Bullseye

Docker

No response

Steps to reproduce

  1. Update from telegraf 1.21.2 to 1.21.3 via offical influx repo
  2. Wait some time, it happens randomly
  3. Wonder why telegraf can't restart, see that it says the /tmp/stats.sock is in use
    ...

Expected behavior

I would expect step 2 and 3 not to happen

Actual behavior

Telegraf crashes due to the segfault and aftereffect the /tmp/stats.sock also seems to be broken.

If I remove the socket and restart telegraf again it works for some time, even days.

Additional info

Based on the crash output I think this PR has to do with it, since we haven't seen this issue with 1.21.2 at all and it started to come up with 1.21.3. The line 263 is mentioned in the goroutine trace and this matches #10360

Metadata

Metadata

Assignees

Labels

area/systembugunexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions