Skip to content

inputs.mqtt_consumer reconnects after network interruption but only yields timeouts thereafter #8722

@mkarlesky

Description

@mkarlesky

Relevant telegraf.conf:

Note: The MQTT Broker is AWS IoT Core.

[[inputs.mqtt_consumer]]
  servers = ["ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883"]

  ## Topics that will be subscribed to.
  topics = [
  "<snip>",
  "<snip>",
  ]

  qos = 0

  connection_timeout = "30s"

  client_id = "<snip>"

  tls_ca = "/etc/telegraf/ca.pem"
  tls_cert = "/etc/telegraf/cert.pem"
  tls_key = "/etc/telegraf/key.pem"

  name_override = "Relative Humidity"

  data_format = "json"

  fieldpass = ["value"]

  json_time_key = "timestamp"
  json_time_format = "unix_ms"

  [inputs.mqtt_consumer.tags]
    Units = "%"

System info:

Telegraf 1.17
Ubuntu 14.04.5

Docker

docker-compose.yaml:

  telegraf:
    image: telegraf:1.17-alpine
    restart: always
    environment:
      - INFLUXDB_V2_URL=https://us-west-2-1.aws.cloud2.influxdata.com
      - INFLUX_TOKEN=<snip>
      - OUR_ORGANIZATION=<snip>
      - INFLUXDB_BUCKET=<snip>
    volumes:
      - …/aws-mqtt-influxcloud-v2.conf:/etc/telegraf/telegraf.conf:ro
      - …/AmazonRootCA1.pem:/etc/telegraf/ca.pem
      - …/certificate.pem.crt:/etc/telegraf/cert.pem
      - …/private.pem.key:/etc/telegraf/key.pem

Steps to reproduce:

  1. Start Telegraf with MQTT consumer configuration for AWS IoT Core and allow to run successfully for a day.
  2. LAN device automatically reboots early in the morning, and Telegraf disconnects.
  3. Telegraf reconnects to MQTT broker after LAN device recovers.
  4. Telegraf MQTT collection timeouts take the place of metrics gathering thereafter.

Expected behavior:

Recovery of MQTT subscriptions after broker reconnection following network interruption.

Actual behavior:

After a connection interruption Telegraf successfully reconnects to the MQTT broker but collection results in timeouts. Only restarting Telegraf fixes the problem.

This problem was reliably reproducible each morning at 6am when a network device automatically rebooted and interrupted the connection to the AWS IoT Core MQTT Broker. Removing the automatic device reboot solved the problem.

Additional info:

 | 2021-01-13T12:23:16Z I! Starting Telegraf 1.17.0
 | 2021-01-13T12:23:16Z I! Using config file: /etc/telegraf/telegraf.conf
 | 2021-01-13T12:23:16Z I! Loaded inputs: mqtt_consumer (2x)
 | 2021-01-13T12:23:16Z I! Loaded aggregators:
 | 2021-01-13T12:23:16Z I! Loaded processors: rename
 | 2021-01-13T12:23:16Z I! Loaded outputs: influxdb_v2
 | 2021-01-13T12:23:16Z I! Tags enabled:
 | 2021-01-13T12:23:16Z I! [agent] Config: Interval:20s, Quiet:false, Hostname:"738565c8ccdc", Flush Interval:10s
 | 2021-01-13T12:23:16Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-13T12:23:16Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]

 …

 | 2021-01-14T11:00:12Z E! [outputs.influxdb_v2] when writing to [https://us-west-2-1.aws.cloud2.influxdata.com]: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:12Z E! [agent] Error writing to outputs.influxdb_v2: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:22Z E! [outputs.influxdb_v2] when writing to [https://us-west-2-1.aws.cloud2.influxdata.com]: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:00:22Z E! [agent] Error writing to outputs.influxdb_v2: Post "https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/write?…": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 | 2021-01-14T11:01:21Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting

 …

 | 2021-01-14T11:02:40Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:02:41Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:00Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
 | 2021-01-14T11:04:01Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
 | 2021-01-14T11:04:20Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:20Z I! [inputs.mqtt_consumer] Connected [ssl://<snip>-ats.iot.us-east-1.amazonaws.com:8883]
 | 2021-01-14T11:04:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:04:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:20Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s
 | 2021-01-14T11:05:40Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 20s

Metadata

Metadata

Assignees

Labels

area/mqttbugunexpected problem or unintended behavior

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions