monitoring: add observability.silenceAlerts#12087
Conversation
Codecov Report
@@ Coverage Diff @@
## master #12087 +/- ##
==========================================
- Coverage 50.67% 50.66% -0.01%
==========================================
Files 1420 1420
Lines 80371 80360 -11
Branches 6831 6668 -163
==========================================
- Hits 40727 40718 -9
+ Misses 36064 36061 -3
- Partials 3580 3581 +1
|
e9e29c6 to
1d5410d
Compare
…ribution/alert-silencing
…ribution/alert-silencing
…ribution/alert-silencing
This sounds good to me, bulk silencing seems like a really bad practice in general so I'd rather not have this and find out we need it later - than enable someone to shoot themselves in the foot so-to-speak :)
Seems reasonable to me, I was just thinking that it is quite cool that with this approach we can un-silence alerts or notify admins in the future to un-silence them if we e.g. see they are flaky on many instances and fix them in an upgrade. That's a cool property of having this in site config :)
This is indeed unfortunate.. I really like having the human-readable names, though. What about having admins first go to the solutions doc (as they would to consider if it might be a real issue for them), and having one of the pre-generated solutions there being "Silence the alert in your site config: This would also mean you could have the site config be this format instead: {
"observability.silenceAlerts": [
"warning_gitserver_disk_space_remaining",
"critical_gitserver_command_duration_test"
]
}WDYT? |
| return | ||
| } | ||
|
|
||
| func changeSilences(ctx context.Context, log log15.Logger, change ChangeContext, newConfig *subscribedSiteConfig) (result ChangeResult) { |
There was a problem hiding this comment.
Add docstring about when this called and what it handles at a high-level.
Sounds good! |
Co-authored-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
|
Looks good! {
"observability.silenceAlerts": [
"warning_gitserver_disk_space_remaining",
"critical_gitserver_command_duration_test"
]
}in the site config will be the typical way of enabling this. |
|
@daxmc99 yep, should be unless anyone has any other thoughts on that! Working on an update to this PR to enable that, but it shouldn't be a significant change implementation-wise |
…ribution/alert-silencing
|
Ran it through creating/deleting silences locally with the new format @slimsag let me know if there's any follow-ups you want me to make! |

Adds
observability.silenceAlerts, which deploys silences tosourcegraph/prometheus's built-in Alertmanager.updated
{ "observability.silenceAlerts": [ "warning_gitserver_disk_space_remaining", "critical_gitserver_command_duration_test" ] }Closes #11210
Considerations
Specificity: I'm opting to disallow broad silences (ie all of
name,level,serviceare required and a user cannot mute all"service": "gitserver"alerts). Regex is possible, but not allowed at the moment. Open to thoughts on this thoughImplementation detail: Alertmanager requires a start and end date for each alert. Right now, I've set this to 10 years... should be fine right?
Via a normal user flow of:
there does not seem to be a clear way to get the
nameof an alert at the moment - not sure if this is within the scope of this PR to address, and if there's a good way to go about it (maybe replace the human-readable description in the main table with alert name?) => resolved via generating full alert names in docs