Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

monitoring: add observability.silenceAlerts#12087

Merged
bobheadxi merged 22 commits into
masterfrom
distribution/alert-silencing
Jul 14, 2020
Merged

monitoring: add observability.silenceAlerts#12087
bobheadxi merged 22 commits into
masterfrom
distribution/alert-silencing

Conversation

@bobheadxi

@bobheadxi bobheadxi commented Jul 10, 2020

Copy link
Copy Markdown
Member

Adds observability.silenceAlerts, which deploys silences to sourcegraph/prometheus's built-in Alertmanager.

updated

{
  "observability.silenceAlerts": [
    "warning_gitserver_disk_space_remaining",
    "critical_gitserver_command_duration_test"
  ]
}

Closes #11210

Considerations

  • Specificity: I'm opting to disallow broad silences (ie all of name, level, service are required and a user cannot mute all "service": "gitserver" alerts). Regex is possible, but not allowed at the moment. Open to thoughts on this though

  • Implementation detail: Alertmanager requires a start and end date for each alert. Right now, I've set this to 10 years... should be fine right?

  • Via a normal user flow of:

    1. see site banner for critical alert
    2. go to grafana

    there does not seem to be a clear way to get the name of an alert at the moment - not sure if this is within the scope of this PR to address, and if there's a good way to go about it (maybe replace the human-readable description in the main table with alert name?) => resolved via generating full alert names in docs

@bobheadxi bobheadxi requested a review from emidoots as a code owner July 10, 2020 09:15
@bobheadxi bobheadxi requested review from a team July 10, 2020 09:15
@codecov

codecov Bot commented Jul 10, 2020

Copy link
Copy Markdown

Codecov Report

Merging #12087 into master will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #12087      +/-   ##
==========================================
- Coverage   50.67%   50.66%   -0.01%     
==========================================
  Files        1420     1420              
  Lines       80371    80360      -11     
  Branches     6831     6668     -163     
==========================================
- Hits        40727    40718       -9     
+ Misses      36064    36061       -3     
- Partials     3580     3581       +1     
Flag Coverage Δ
#go 52.07% <100.00%> (-0.01%) ⬇️
#integration 24.06% <ø> (-0.01%) ⬇️
#storybook 11.64% <ø> (ø)
#typescript 46.95% <ø> (-0.01%) ⬇️
#unit 47.38% <100.00%> (-0.01%) ⬇️
Impacted Files Coverage Δ
cmd/frontend/internal/app/debug.go 17.14% <100.00%> (-3.55%) ⬇️
.../internal/codeintel/resolvers/graphql/locations.go 85.45% <0.00%> (-1.82%) ⬇️
web/src/tree/TreeRoot.tsx 79.31% <0.00%> (-1.73%) ⬇️

@bobheadxi bobheadxi force-pushed the distribution/alert-silencing branch from e9e29c6 to 1d5410d Compare July 10, 2020 09:35
@bobheadxi bobheadxi mentioned this pull request Jul 11, 2020
56 tasks
@emidoots

emidoots commented Jul 14, 2020

Copy link
Copy Markdown
Member

Specificity: I'm opting to disallow broad silences (ie all of name, level, service are required and a user cannot mute all "service": "gitserver" alerts). Regex is possible, but not allowed at the moment. Open to thoughts on this though

This sounds good to me, bulk silencing seems like a really bad practice in general so I'd rather not have this and find out we need it later - than enable someone to shoot themselves in the foot so-to-speak :)

Implementation detail: Alertmanager requires a start and end date for each alert. Right now, I've set this to 10 years... should be fine right?

Seems reasonable to me, I was just thinking that it is quite cool that with this approach we can un-silence alerts or notify admins in the future to un-silence them if we e.g. see they are flaky on many instances and fix them in an upgrade. That's a cool property of having this in site config :)

there does not seem to be a clear way to get the name of an alert at the moment - not sure if this is within the scope of this PR to address, and if there's a good way to go about it (maybe replace the human-readable description in the main table with alert name?)

This is indeed unfortunate.. I really like having the human-readable names, though. What about having admins first go to the solutions doc (as they would to consider if it might be a real issue for them), and having one of the pre-generated solutions there being "Silence the alert in your site config: <name>"?

This would also mean you could have the site config be this format instead:

{
  "observability.silenceAlerts": [
    "warning_gitserver_disk_space_remaining",
    "critical_gitserver_command_duration_test"
  ]
}

WDYT?

Comment thread cmd/frontend/internal/app/debug_test.go Outdated
return
}

func changeSilences(ctx context.Context, log log15.Logger, change ChangeContext, newConfig *subscribedSiteConfig) (result ChangeResult) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring about when this called and what it handles at a high-level.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread docker-images/prometheus/cmd/prom-wrapper/change.go Outdated
@bobheadxi

Copy link
Copy Markdown
Member Author

This is indeed unfortunate.. I really like having the human-readable names, though. What about having admins first go to the solutions doc (as they would to consider if it might be a real issue for them), and having one of the pre-generated solutions there being "Silence the alert in your site config: "?

This would also mean you could have the site config be this format instead

Sounds good!

bobheadxi and others added 3 commits July 14, 2020 08:51
Co-authored-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
@daxmc99

daxmc99 commented Jul 14, 2020

Copy link
Copy Markdown
Contributor

Looks good!
Just wanted to confirm that

{
  "observability.silenceAlerts": [
    "warning_gitserver_disk_space_remaining",
    "critical_gitserver_command_duration_test"
  ]
}

in the site config will be the typical way of enabling this.

@bobheadxi

Copy link
Copy Markdown
Member Author

@daxmc99 yep, should be unless anyone has any other thoughts on that! Working on an update to this PR to enable that, but it shouldn't be a significant change implementation-wise

@bobheadxi

bobheadxi commented Jul 14, 2020

Copy link
Copy Markdown
Member Author

2c3f7b9 , e5fa2a0 makes the change to specify silences by alertname and generates alertname in the alert solutions documentation for easier silencing :)

@bobheadxi bobheadxi requested a review from emidoots July 14, 2020 06:08
@bobheadxi

bobheadxi commented Jul 14, 2020

Copy link
Copy Markdown
Member Author

Ran it through creating/deleting silences locally with the new format

image

@slimsag let me know if there's any follow-ups you want me to make!

@bobheadxi bobheadxi merged commit 480af2d into master Jul 14, 2020
@bobheadxi bobheadxi deleted the distribution/alert-silencing branch July 14, 2020 07:59
@bobheadxi bobheadxi mentioned this pull request Jul 17, 2020
55 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

explore solutions for "impossible to silence monitoring alerts"

3 participants