feature-7338 Retry failed remote HTTP Config on startup by clever-trevor · Pull Request #7349 · influxdata/telegraf

clever-trevor · 2020-04-16T10:22:23Z

Required for all PRs:

[Y] Signed CLA.
[Y] Associated README.md updated.
[Y] Has appropriate unit tests.

Retry remote HTTP config pull on agent startup to prevent agent stopping if HTTP endpoint is down
#7338

danielnelson · 2020-04-16T21:13:13Z

I'd prefer to avoid adding a new cli option specifically for this, but perhaps this could be part of a solution to #3723. @ssoroka I know you working in this area of the code, do you think it would make sense to tackle this at the same time?

clever-trevor · 2020-04-17T09:11:18Z

Hi, Thanks for the update. I took a look at 3723 and I can’t tell if it’s the same area, but it does look like in that case, the agent has already received/parsed it’s config phase, whereas here, the agent still hasn’t started. Either way, I’m also happy to not have the dedicated CLI and use a hard-coded (say) 10 minute retry if the HTTP server is down if that makes it easier. Thanks

ssoroka · 2020-04-20T15:04:09Z

I think the hope is to move the retry logic out of the plugins so that you can decide if they should retry at the agent level, and then get consistent behavior across all of them. I guess the one thing that would get a bit weird here is that the output plugin's Connect method would get called more than once, which it currently does not. That may expose some new bugs with plugins that don't expect this. That's probably still the way to go, though.

clever-trevor · 2020-06-09T07:51:33Z

As an interim and quick solution, could we put the retry logic around the HTTP config request so that it loops / sleeps until the web server is up?

Can use a fixed retry period as it's better than the agent failing completely.

ssoroka · 2020-06-09T17:50:13Z

internal/config/config.go

+			log.Printf("Unable to connect to HTTP config server.  Retry in %d seconds", configHttpRetry)
+		} else if resp.StatusCode != http.StatusOK {
+			log.Printf("Error getting HTTP config, retry in %d seconds.  Status=%s", configHttpRetry, resp.Status)
+		} else {


remove the use of else blocks. It indents the main-line code path and makes it harder to read. Use continue instead to restart the for loop.

ssoroka · 2020-06-09T17:50:43Z

internal/config/config.go

+		return nil, fmt.Errorf("failed to process config-http-retry ")
 	}
+	// Loop to retry http config request to prevent agent dying on startup if endpoint is unavailable
+	for {


should probably support max-retries before permanently failing. Not a big fan of looping forever.

Sure, I did propose a CLI flag to set max retries further up the chain, but understand it adds to the mass of options.

Personally, I would like it to return forever, otherwise someone it means some manual intervention is required, but others may not so think an option would be useful to drive desired behaviour.

Would using an environment variable setting be an alternative approach to a CLI switch here?
e.g. export TELEGRAF_HTTP_RETRIES=50
It not set, then we just use a default of (say) 10 iterations every 1 minute.

If so, then I'll take a look at building this in an resubmitting for review.

ssoroka · 2020-06-09T17:51:52Z

I think I misunderstood what you were trying to accomplish. I read it over again and left some feedback. I'd be in support of getting this in with a couple small changes.

danielnelson · 2020-06-10T19:13:17Z

We have some longer term plans for configuration loading that will change things up with where the configuration is loaded. The loading method will be configured from inside a main local configuration file, along the lines of:

[[config.http]]
  url = "http://example.org"
  retry = 5
  timeout = "5s"

The reason I don't want to add new API surface area is because when these plans are implemented it will all need to go away. Since we don't remove API to prevent breaking installs, we will have to continue to maintain the functionality.

Switching to environment variables doesn't really change the fundamentals, it's just a different way of setting the options.

How are you launching Telegraf right now? You may be able to just configure your init system to restart Telegraf on startup failure.

clever-trevor · 2020-06-10T19:50:29Z

Hi, all in favour of strategic, consistent ways of doing things, so supportive of that.

I use SystemD on Linux and SCM on Windows, but it seems a bit clunky to rely on a failure to trigger a restart, "fail -> retry -> fail -> retry", etc.
Much neater for the agent to handle as many error conditions internally and keep systemd/SCM clean and handling real exceptions.

If the long term is not too long away, I'm happy with that approach, but if we're talking months or longer, could I ask this be included as a short term workaround for this particular issue?

The only other comment is that the "[[config.input]]" section implies there is a local config file at startup so just wanted to check this would be a basic stub file which allows the agent to start, and on initiation, would then pull the remote config and layer that on top. i.e. merge local and remote config
If so, a welcome addition to this would also be the ability to re-poll the HTTP server for new config on a schedule and to reload if a change was detected.

e.g.

[[config.http]]
  url = "http://example.org"
  retry = 5
  timeout = "5s"
  reload = "1h"

danielnelson · 2020-06-10T20:32:57Z

Yes, the plan is to have the configuration files merged, so you could have local and remote configs, as well as multiple remote configurations.

You would need a base local config file, but in theory it could be as small as:

[[config.http]]
  url = "http://example.org"

We are planning ways to have Telegraf reload configuration automatically either based on polling or using events.

We are definitely talking months before I expect you will see this though. In the meantime, I'd be okay with adding a simple retry loop without any configuration, doing a reasonable but limited number of retries, perhaps ~3 retries at 10s each.

clever-trevor · 2020-06-11T06:24:12Z

Thanks for the update Daniel, the future state sounds a much better approach.

Just to reiterate the need for retries.
If (as will happen) the HTTP server is down for some unplanned and prolonged reason, we don't want any Telegraf agents which start up during that time to fail as it means someone or something needs to intervene and take action,
So whilst a fixed x retries @ y seconds would help, in my environment, I'd want it to keep retrying and auto-recover itself when the endpoint is back online.

If other users of Telegraf want a shorter and fixed retry period, the pull request would support that also.

Could you consider merging my code (or a form of it) with the option to specify retry intervals and attempts through the Env variables as it provides the flexibility to meet my own use case and others too.

danielnelson · 2020-06-11T17:27:52Z

Not with the configurable options since these cannot be removed, and we wouldn't need them as part of the future design. Setting up the init system to restart should be a decent workaround for now, or you can always build a custom version with whatever you need.

clever-trevor · 2020-06-12T07:59:00Z

Ok, understand the approach and how this could build in some legacy which is difficult to remove.
Will look into the systemd restart approach. On Windows however, I guess SCM would be the workaround there but there are no options to set Recovery action when installed Telegraf as a service. I realise we could hack the registry after installing as a service, but wanted to check if there are any options in the built-in Telegraf service install option to set recovery actions?

danielnelson · 2020-06-18T19:14:52Z

One tool that could be helpful is https://nssm.cc/, I know many Windows Telegraf users prefer to setup the service with this tool.

sjwang90 · 2021-03-02T20:40:58Z

Hey @schmorgs - based off our conversation I'm going to close this PR for our tracking purposes. I've kept #8854 open to keep track of this feature as we build out our larger configuration management capabilities.

clever-trevor added 2 commits April 16, 2020 11:17

feature-7338 Retry failed remote HTTP Config on startup

9c56eb8

feature-7338 Retry failed remote HTTP Config on startup

7cb5e1d

clever-trevor mentioned this pull request Apr 16, 2020

Agent won't start if remote HTTP config endpoint is down #7338

Closed

ssoroka reviewed Jun 9, 2020

View reviewed changes

sjwang90 mentioned this pull request Nov 24, 2020

HTTP Config Retry on Failure Feature/7338 #7339

Closed

sjwang90 added the feature request Requests for new plugin and for new features to existing plugins label Nov 24, 2020

helenosheaa self-assigned this Feb 2, 2021

helenosheaa mentioned this pull request Feb 4, 2021

Add default retry for load config via url #8803

Merged

sjwang90 closed this Mar 2, 2021

Conversation

clever-trevor commented Apr 16, 2020

Required for all PRs:

Uh oh!

danielnelson commented Apr 16, 2020

Uh oh!

clever-trevor commented Apr 17, 2020 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssoroka commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clever-trevor commented Jun 9, 2020

Uh oh!

ssoroka Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

ssoroka Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

clever-trevor Jun 10, 2020

Choose a reason for hiding this comment

Uh oh!

ssoroka commented Jun 9, 2020

Uh oh!

danielnelson commented Jun 10, 2020

Uh oh!

clever-trevor commented Jun 10, 2020

Uh oh!

danielnelson commented Jun 10, 2020

Uh oh!

clever-trevor commented Jun 11, 2020

Uh oh!

danielnelson commented Jun 11, 2020

Uh oh!

clever-trevor commented Jun 12, 2020

Uh oh!

danielnelson commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjwang90 commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clever-trevor commented Apr 17, 2020 via email •

edited

Loading

ssoroka commented Apr 20, 2020 •

edited

Loading

danielnelson commented Jun 18, 2020 •

edited

Loading