feature-7338 Retry failed remote HTTP Config on startup#7349
feature-7338 Retry failed remote HTTP Config on startup#7349clever-trevor wants to merge 2 commits intoinfluxdata:masterfrom clever-trevor:feature-7338
Conversation
|
Hi,
Thanks for the update.
I took a look at 3723 and I can’t tell if it’s the same area, but it does look like in that case, the agent has already received/parsed it’s config phase, whereas here, the agent still hasn’t started.
Either way, I’m also happy to not have the dedicated CLI and use a hard-coded (say) 10 minute retry if the HTTP server is down if that makes it easier.
Thanks
|
|
|
|
As an interim and quick solution, could we put the retry logic around the HTTP config request so that it loops / sleeps until the web server is up? Can use a fixed retry period as it's better than the agent failing completely. |
| log.Printf("Unable to connect to HTTP config server. Retry in %d seconds", configHttpRetry) | ||
| } else if resp.StatusCode != http.StatusOK { | ||
| log.Printf("Error getting HTTP config, retry in %d seconds. Status=%s", configHttpRetry, resp.Status) | ||
| } else { |
There was a problem hiding this comment.
remove the use of else blocks. It indents the main-line code path and makes it harder to read. Use continue instead to restart the for loop.
| return nil, fmt.Errorf("failed to process config-http-retry ") | ||
| } | ||
| // Loop to retry http config request to prevent agent dying on startup if endpoint is unavailable | ||
| for { |
There was a problem hiding this comment.
should probably support max-retries before permanently failing. Not a big fan of looping forever.
There was a problem hiding this comment.
Sure, I did propose a CLI flag to set max retries further up the chain, but understand it adds to the mass of options.
Personally, I would like it to return forever, otherwise someone it means some manual intervention is required, but others may not so think an option would be useful to drive desired behaviour.
Would using an environment variable setting be an alternative approach to a CLI switch here?
e.g. export TELEGRAF_HTTP_RETRIES=50
It not set, then we just use a default of (say) 10 iterations every 1 minute.
If so, then I'll take a look at building this in an resubmitting for review.
|
I think I misunderstood what you were trying to accomplish. I read it over again and left some feedback. I'd be in support of getting this in with a couple small changes. |
|
We have some longer term plans for configuration loading that will change things up with where the configuration is loaded. The loading method will be configured from inside a main local configuration file, along the lines of: [[config.http]]
url = "http://example.org"
retry = 5
timeout = "5s"The reason I don't want to add new API surface area is because when these plans are implemented it will all need to go away. Since we don't remove API to prevent breaking installs, we will have to continue to maintain the functionality. Switching to environment variables doesn't really change the fundamentals, it's just a different way of setting the options. How are you launching Telegraf right now? You may be able to just configure your init system to restart Telegraf on startup failure. |
|
Hi, all in favour of strategic, consistent ways of doing things, so supportive of that. I use SystemD on Linux and SCM on Windows, but it seems a bit clunky to rely on a failure to trigger a restart, "fail -> retry -> fail -> retry", etc. If the long term is not too long away, I'm happy with that approach, but if we're talking months or longer, could I ask this be included as a short term workaround for this particular issue? The only other comment is that the "[[config.input]]" section implies there is a local config file at startup so just wanted to check this would be a basic stub file which allows the agent to start, and on initiation, would then pull the remote config and layer that on top. i.e. merge local and remote config e.g. |
|
Yes, the plan is to have the configuration files merged, so you could have local and remote configs, as well as multiple remote configurations. You would need a base local config file, but in theory it could be as small as: [[config.http]]
url = "http://example.org"We are planning ways to have Telegraf reload configuration automatically either based on polling or using events. We are definitely talking months before I expect you will see this though. In the meantime, I'd be okay with adding a simple retry loop without any configuration, doing a reasonable but limited number of retries, perhaps ~3 retries at 10s each. |
|
Thanks for the update Daniel, the future state sounds a much better approach. Just to reiterate the need for retries. If other users of Telegraf want a shorter and fixed retry period, the pull request would support that also. Could you consider merging my code (or a form of it) with the option to specify retry intervals and attempts through the Env variables as it provides the flexibility to meet my own use case and others too. |
|
Not with the configurable options since these cannot be removed, and we wouldn't need them as part of the future design. Setting up the init system to restart should be a decent workaround for now, or you can always build a custom version with whatever you need. |
|
Ok, understand the approach and how this could build in some legacy which is difficult to remove. |
|
One tool that could be helpful is https://nssm.cc/, I know many Windows Telegraf users prefer to setup the service with this tool. |
|
Hey @schmorgs - based off our conversation I'm going to close this PR for our tracking purposes. I've kept #8854 open to keep track of this feature as we build out our larger configuration management capabilities. |
Required for all PRs:
Retry remote HTTP config pull on agent startup to prevent agent stopping if HTTP endpoint is down
#7338