Feature Request
Related: #7338
Proposal:
User should be able to designated the interval and number of retries for loading their config from a URL if their endpoint is down.
Current behavior:
Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).
Desired behavior:
User needs some way to configure interval and number of retries settings to determine the behavior of loading the config from a URL.
Use case:
From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.
So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.
The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.
In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.
Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.
Feature Request
Related: #7338
Proposal:
User should be able to designated the
intervalandnumber of retriesfor loading their config from a URL if their endpoint is down.Current behavior:
Right now, Telegraf retries three times at 10s intervals when receiving an error on loading config from a url in the case of the remote endpoint being down. Current solution does not use env variables or use flags to change these settings (based on #8803).
Desired behavior:
User needs some way to configure
intervalandnumber of retriessettings to determine the behavior of loading the config from a URL.Use case:
From @schmorgs:
Planning to use Telegraf in production across a large number of servers across the globe, and there are many points where breakages could happen, especially in countries where there is very low bandwidth and old infrastructure. Along with that comes many standards and versions of OS, etc, hence our approach to manage config centrally so that we don't have to navigate the variety of ways of reaching an endpoint.
So if Telegraf starts up and there happened to be a breakage somewhere (NW connectivity, Web Server down, etc), the agent will die. On RHEL7/8 and Windows, we can utilise systemd/SCM to configure infinite retries on the agent so that even if it does die, it will be restarted.
But RHEL6 doesn't have systemd and so we would end up writing some sort of watcher daemon as well which seems a bit overkill if the agent could handle (at least) this condition.
The reason for the importance is this will be our primary monitoring agent and so want to make this as available and robust as possible. We would still implement external controls such as systemd restarts to provide an extra layer of resilience, but the more the agent can do in this area makes just adds to this.
In some cases, the situation where the agent was unable to get config would be fairly small as the agent only pulls config on startup. But we want the agent to periodically pull its config down so that it can be configured centrally and automatically pulled by the agent. I understand this is part of a longer term strategy for Telegraf, but in the meantime, we HUP the agent periodically as a workaround, and so now the agent has constant reliability on the HTTP endpoint and therefore, more likelihood of encountering a problem.
Whether a switch, environment variable, config file on the server, etc, I'm happy to see whichever approach works best.