fix: windows service - graceful shutdown of telegraf#9616
Merged
reimda merged 2 commits intoinfluxdata:masterfrom Dec 2, 2021
Merged
fix: windows service - graceful shutdown of telegraf#9616reimda merged 2 commits intoinfluxdata:masterfrom
reimda merged 2 commits intoinfluxdata:masterfrom
Conversation
Contributor
Author
|
Any maintainers able to review/comment on this draft PR? |
17d0f26 to
d1c4047
Compare
…e to end when running as Windows service
d1c4047 to
ee0fc31
Compare
Contributor
|
📦 Looks like new artifacts were built from this PR. Expand this list to get them here! 🐯Artifact URLs |
Contributor
Author
|
Bump - can we get this reviewed? |
powersj
approved these changes
Nov 22, 2021
Contributor
powersj
left a comment
There was a problem hiding this comment.
I've been trying to play with this on Windows 10 and it does seem to do a better job of cleaning up the Telegraf and stopping more gracefully.
reimda
approved these changes
Dec 2, 2021
MyaLongmire
pushed a commit
that referenced
this pull request
Dec 8, 2021
(cherry picked from commit b954e56)
powersj
pushed a commit
to powersj/telegraf
that referenced
this pull request
Jan 21, 2022
phemmer
added a commit
to phemmer/telegraf
that referenced
this pull request
Feb 18, 2022
* origin/master: (133 commits) chore: restart service if it is already running and upgraded via RPM (influxdata#9970) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237) fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188) fix(http_listener_v2): fix panic on close (influxdata#10132) feat: add Vault input plugin (influxdata#10198) feat: support aws managed service for prometheus (influxdata#10202) fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246) Update changelog feat: Modbus add per-request tags (influxdata#10231) fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196) feat: add nomad input plugin (influxdata#10106) fix: Print loaded plugins and deprecations for once and test (influxdata#10205) fix: eliminate MIB dependency for ifname processor (influxdata#10214) feat: Optimize locking for SNMP MIBs loading. (influxdata#10206) feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150) feat: update configs (influxdata#10236) fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975) fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230) fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913) fix: json_v2 parser timestamp setting (influxdata#10221) fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209) docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218) fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099) fix: parallelism fix for ifname processor (influxdata#10007) chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191) docs: address documentation gap when running telegraf in k8s (influxdata#10215) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211) fix: mqtt topic extracting no longer requires all three fields (influxdata#10208) fix: windows service - graceful shutdown of telegraf (influxdata#9616) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201) feat: Modbus support multiple slaves (gateway feature) (influxdata#9279) fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203) chore: remove triggering update-config bot in CI (influxdata#10195) Update changelog feat: Implement deprecation infrastructure (influxdata#10200) fix: extra lock on init for safety (influxdata#10199) fix: resolve influxdata#10027 (influxdata#10112) fix: register bigquery to output plugins influxdata#10177 (influxdata#10178) fix: sysstat use unique temp file vs hard-coded (influxdata#10165) refactor: snmp to use gosmi (influxdata#9518) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Required for all PRs:
Resolves #7876, however the issue affects telegraf agent behaviour across all plugins when telegraf runs as service, not just the execd plugin.
The current telegraf service implementation does not appear to wait for the main telegraf agent loop to gracefully finish before the main thread exits. This causes an abrupt exit and plugins do not have a chance to shut down / clean up properly (ie. outputs are not flushed, inputs with external dependencies are not closed properly).
I'm submitting this PR initially as a review - I am not sure this is the right solution to cover all scenarios. However from testing, it does work as intended under ideal scenarios - the main goroutine thread is blocked until the agent thread runs to completion.
The code changes are fairly simple. When the service stop control signal is triggered in Windows, the service interface
Stopis called (see https://github.com/kardianos/service/blob/v1.0.0/service_windows.go#L282). TheStopinterface sends a struct to thestopchannel so that thereloadLoopruns the context cancel function to commence agent shutdown. While shutdown is in progress, theStopinterface is blocked waiting for thestopchannel to be closed. OncereloadLoopcompletes, we close thestopchannel to allow the main thread to complete.My hesitancy with this PR being a full fix is that I do not fully understand the windows service specific concurrency implementation, so I'm unsure if this PR opens up telegraf to other potential race condition or deadlocks.
@ssoroka