[Heartbeat] Log error on dupe monitor ID instead of strict req#29041
[Heartbeat] Log error on dupe monitor ID instead of strict req#29041andrewvc merged 14 commits intoelastic:masterfrom
Conversation
Currently Heartbeat will not start > 1 monitor with the same ID. This can create unintuitive scenarios in situations where a user changes a monitor by adding a new entry via some reload mechanism (say k8s autodiscover) before the old one is deleted. This PR changes the behavior to only log errors in this situation. Only one monitor will be active, but it will be the newest monitor.
|
Pinging @elastic/uptime (Team:Uptime) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
justinkambic
left a comment
There was a problem hiding this comment.
One thing I'm noticing in my smoke testing of this patch is that, while it does let me run a config with multiple monitors sharing an ID, it doesn't seem to respect the usual CTRL+C I'd
use to kill the process.
When attempting to run the same multi-shared-ID config on master I got the error message/no-op that is typical of HB up until now.
In the case of different monitors with the same ID it did seem to log them, per the included screenshot:
vigneshshanmugam
left a comment
There was a problem hiding this comment.
Code changes LGTM. Added small comments.
heartbeat/monitors/dedup.go
Outdated
|
|
||
| closed := um.stopUnsafe(m) | ||
| if closed { | ||
| logp.Warn("monitor ID %s is configured for multiple monitors! IDs should be unique values, last seen config will win", m.stdFields.ID) |
There was a problem hiding this comment.
should we log like stopping previous monitor with same monitor id?
There was a problem hiding this comment.
I don't know if that adds anything, because it's hard to tell them apart. Should I change the log message maybe?
heartbeat/monitors/monitor_test.go
Outdated
| m2.stdFields.Name = "MON2!!!" | ||
| // This used to trigger an error, but shouldn't any longer, we just log | ||
| // the error, and ensure the last monitor wins | ||
| require.NoError(t, m2Err) |
There was a problem hiding this comment.
can we also check if previous monitor is stopped?
There was a problem hiding this comment.
We do check for this at the end of the test by checking the number of stops invoked
jsoriano
left a comment
There was a problem hiding this comment.
Added some comments, mainly about avoiding the global objects if possible.
heartbeat/monitors/monitor.go
Outdated
| func (e ErrDuplicateMonitorID) Error() string { | ||
| return fmt.Sprintf("monitor ID %s is configured for multiple monitors! IDs must be unique values.", e.ID) | ||
| } | ||
| var globalDedup = newDedup() |
There was a problem hiding this comment.
Could this be part of the RunnerFactory to avoid needing a global object?
heartbeat/monitors/monitor.go
Outdated
|
|
||
| // De-duplicate monitors with identical IDs | ||
| // last write wins | ||
| globalDedup.register(m) |
There was a problem hiding this comment.
newMonitor is used by CheckConfig. Checking configs shouldn't stop monitors existing monitors.
There was a problem hiding this comment.
Fixed this, it is now the case that newMonitor is side effect free
|
@justinkambic there was a deadlock issue that I've now fixed, that was blocking clean shutdowns, ctrl-c should now work, that's important! |
| var runners []cfgfile.Runner | ||
| for _, cfg := range bt.config.Monitors { | ||
| created, err := factory.Create(b.Publisher, cfg) | ||
| created, err := bt.dynamicFactory.Create(b.Publisher, cfg) |
There was a problem hiding this comment.
Cleans everything up so that we only use a single factory instance in heartbeat
| internalsMtx: sync.Mutex{}, | ||
| config: config, | ||
| stats: pluginFactory.Stats, | ||
| state: MON_INIT, |
There was a problem hiding this comment.
Prevents weird situations with dual stops, makes stop() idempotent.
|
OK, this should be ready for review again, it's largely rewritten, now using the factory for all state, and generally cleaning up how all the lifecycle stuff is handled. Thanks for the great input @jsoriano @justinkambic @vigneshshanmugam Races and deadlocks should all be gone. |
jsoriano
left a comment
There was a problem hiding this comment.
LGTM if it looks good to uptime 🙂
|
|
||
| // Stop stops the Monitor's execution in its configured scheduler. | ||
| // This is safe to call even if the Monitor was never started. | ||
| // Stop stops the monitor without freeing it in global dedup |
There was a problem hiding this comment.
| // Stop stops the monitor without freeing it in global dedup | |
| // Stop the monitor without freeing it in global dedup |
I know the function name is a proper noun here but it feels kinda redundant.
(cherry picked from commit dbca099)
…ws-on-file-changes * upstream/master: override host on statsd metricset (elastic#29103) Skip config check in autodiscover for duplicated configurations (elastic#29048) Change "filebeat.config.modules.enabled" to "true" (elastic#28769) Remove deprecated spool queue from Beats (elastic#28869) Add `beat` field back to beat.stats (elastic#29094) Revert "Move labels and annotations under kubernetes.namespace. (elastic#27917)" (elastic#29069) heartbeat: remove w2008 in the CI (elastic#29093) Remove deprecated `--template` and `--index-policy` flags (elastic#28870) Fix parsing of apache trace log levels (elastic#28717) [Elastic-Agent] IUse itnernal port for local fleet server (elastic#28993) [Heartbeat] Log error on dupe monitor ID instead of strict req (elastic#29041) Enable pprof for elastic-agent and beats (elastic#28983)





Currently Heartbeat will not start > 1 monitor with the same ID. This can create unintuitive scenarios in situations where a user
changes a monitor by adding a new entry via some reload mechanism (say k8s autodiscover) before the old one is deleted.
This PR changes the behavior to only log errors in this situation. Only one monitor will be active, but it will be the newest monitor.
There may be other fixes we should do in other areas, but this is probably a better failure method than what we currently have.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.