Skip to content

feature-7338 Retry failed remote HTTP Config on startup#7349

Closed
clever-trevor wants to merge 2 commits intoinfluxdata:masterfrom
clever-trevor:feature-7338
Closed

feature-7338 Retry failed remote HTTP Config on startup#7349
clever-trevor wants to merge 2 commits intoinfluxdata:masterfrom
clever-trevor:feature-7338

Conversation

@clever-trevor
Copy link
Copy Markdown

Required for all PRs:

  • [Y] Signed CLA.
  • [Y] Associated README.md updated.
  • [Y] Has appropriate unit tests.

Retry remote HTTP config pull on agent startup to prevent agent stopping if HTTP endpoint is down
#7338

@danielnelson
Copy link
Copy Markdown
Contributor

I'd prefer to avoid adding a new cli option specifically for this, but perhaps this could be part of a solution to #3723. @ssoroka I know you working in this area of the code, do you think it would make sense to tackle this at the same time?

@clever-trevor
Copy link
Copy Markdown
Author

clever-trevor commented Apr 17, 2020 via email

@ssoroka
Copy link
Copy Markdown
Contributor

ssoroka commented Apr 20, 2020

I think the hope is to move the retry logic out of the plugins so that you can decide if they should retry at the agent level, and then get consistent behavior across all of them. I guess the one thing that would get a bit weird here is that the output plugin's Connect method would get called more than once, which it currently does not. That may expose some new bugs with plugins that don't expect this. That's probably still the way to go, though.

@clever-trevor
Copy link
Copy Markdown
Author

As an interim and quick solution, could we put the retry logic around the HTTP config request so that it loops / sleeps until the web server is up?

Can use a fixed retry period as it's better than the agent failing completely.

log.Printf("Unable to connect to HTTP config server. Retry in %d seconds", configHttpRetry)
} else if resp.StatusCode != http.StatusOK {
log.Printf("Error getting HTTP config, retry in %d seconds. Status=%s", configHttpRetry, resp.Status)
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the use of else blocks. It indents the main-line code path and makes it harder to read. Use continue instead to restart the for loop.

return nil, fmt.Errorf("failed to process config-http-retry ")
}
// Loop to retry http config request to prevent agent dying on startup if endpoint is unavailable
for {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably support max-retries before permanently failing. Not a big fan of looping forever.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I did propose a CLI flag to set max retries further up the chain, but understand it adds to the mass of options.

Personally, I would like it to return forever, otherwise someone it means some manual intervention is required, but others may not so think an option would be useful to drive desired behaviour.

Would using an environment variable setting be an alternative approach to a CLI switch here?
e.g. export TELEGRAF_HTTP_RETRIES=50
It not set, then we just use a default of (say) 10 iterations every 1 minute.

If so, then I'll take a look at building this in an resubmitting for review.

@ssoroka
Copy link
Copy Markdown
Contributor

ssoroka commented Jun 9, 2020

I think I misunderstood what you were trying to accomplish. I read it over again and left some feedback. I'd be in support of getting this in with a couple small changes.

@danielnelson
Copy link
Copy Markdown
Contributor

We have some longer term plans for configuration loading that will change things up with where the configuration is loaded. The loading method will be configured from inside a main local configuration file, along the lines of:

[[config.http]]
  url = "http://example.org"
  retry = 5
  timeout = "5s"

The reason I don't want to add new API surface area is because when these plans are implemented it will all need to go away. Since we don't remove API to prevent breaking installs, we will have to continue to maintain the functionality.

Switching to environment variables doesn't really change the fundamentals, it's just a different way of setting the options.

How are you launching Telegraf right now? You may be able to just configure your init system to restart Telegraf on startup failure.

@clever-trevor
Copy link
Copy Markdown
Author

Hi, all in favour of strategic, consistent ways of doing things, so supportive of that.

I use SystemD on Linux and SCM on Windows, but it seems a bit clunky to rely on a failure to trigger a restart, "fail -> retry -> fail -> retry", etc.
Much neater for the agent to handle as many error conditions internally and keep systemd/SCM clean and handling real exceptions.

If the long term is not too long away, I'm happy with that approach, but if we're talking months or longer, could I ask this be included as a short term workaround for this particular issue?

The only other comment is that the "[[config.input]]" section implies there is a local config file at startup so just wanted to check this would be a basic stub file which allows the agent to start, and on initiation, would then pull the remote config and layer that on top. i.e. merge local and remote config
If so, a welcome addition to this would also be the ability to re-poll the HTTP server for new config on a schedule and to reload if a change was detected.

e.g.

[[config.http]]
  url = "http://example.org"
  retry = 5
  timeout = "5s"
  reload = "1h"

@danielnelson
Copy link
Copy Markdown
Contributor

Yes, the plan is to have the configuration files merged, so you could have local and remote configs, as well as multiple remote configurations.

You would need a base local config file, but in theory it could be as small as:

[[config.http]]
  url = "http://example.org"

We are planning ways to have Telegraf reload configuration automatically either based on polling or using events.

We are definitely talking months before I expect you will see this though. In the meantime, I'd be okay with adding a simple retry loop without any configuration, doing a reasonable but limited number of retries, perhaps ~3 retries at 10s each.

@clever-trevor
Copy link
Copy Markdown
Author

Thanks for the update Daniel, the future state sounds a much better approach.

Just to reiterate the need for retries.
If (as will happen) the HTTP server is down for some unplanned and prolonged reason, we don't want any Telegraf agents which start up during that time to fail as it means someone or something needs to intervene and take action,
So whilst a fixed x retries @ y seconds would help, in my environment, I'd want it to keep retrying and auto-recover itself when the endpoint is back online.

If other users of Telegraf want a shorter and fixed retry period, the pull request would support that also.

Could you consider merging my code (or a form of it) with the option to specify retry intervals and attempts through the Env variables as it provides the flexibility to meet my own use case and others too.

@danielnelson
Copy link
Copy Markdown
Contributor

Not with the configurable options since these cannot be removed, and we wouldn't need them as part of the future design. Setting up the init system to restart should be a decent workaround for now, or you can always build a custom version with whatever you need.

@clever-trevor
Copy link
Copy Markdown
Author

Ok, understand the approach and how this could build in some legacy which is difficult to remove.
Will look into the systemd restart approach. On Windows however, I guess SCM would be the workaround there but there are no options to set Recovery action when installed Telegraf as a service. I realise we could hack the registry after installing as a service, but wanted to check if there are any options in the built-in Telegraf service install option to set recovery actions?

@danielnelson
Copy link
Copy Markdown
Contributor

danielnelson commented Jun 18, 2020

One tool that could be helpful is https://nssm.cc/, I know many Windows Telegraf users prefer to setup the service with this tool.

@sjwang90 sjwang90 added the feature request Requests for new plugin and for new features to existing plugins label Nov 24, 2020
@helenosheaa helenosheaa self-assigned this Feb 2, 2021
@sjwang90
Copy link
Copy Markdown
Contributor

sjwang90 commented Mar 2, 2021

Hey @schmorgs - based off our conversation I'm going to close this PR for our tracking purposes. I've kept #8854 open to keep track of this feature as we build out our larger configuration management capabilities.

@sjwang90 sjwang90 closed this Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request Requests for new plugin and for new features to existing plugins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants