Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

prometheus: bundle Alertmanager and siteConfig sync#11832

Merged
bobheadxi merged 58 commits into
masterfrom
distribution/prom-alerting
Jul 8, 2020
Merged

prometheus: bundle Alertmanager and siteConfig sync#11832
bobheadxi merged 58 commits into
masterfrom
distribution/prom-alerting

Conversation

@bobheadxi

@bobheadxi bobheadxi commented Jul 1, 2020

Copy link
Copy Markdown
Member

Context

For onlookers, some recent product-level context may help: this change is aiding in getting us to a point where we can push all site admins to turn on monitoring for their Sourcegraph instance by having them configure alerting through the Sourcegraph site configuration and having our monitoring stack automatically respond to that. This comment goes into a little more detail.

Changes

This PR completely removes the grafana-wrapper and functionality for deploying observability.alerts as Grafana alerts (effectively reverting it to vanilla Grafana), and replaces it with a new prom-wrapper that bundles Prometheus + Alertmanager in sourcegraph/prometheus alongside the same observability.alerts features added in #10641

The rationale for this change is covered in the comments of https://github.com/sourcegraph/sourcegraph/issues/11452, but in a nutshell:

Other advantages:

  • Alertmanager configuration just involves writing to a file and hitting a reload endpoint, which simplifies the implementation (monitoring: grafana email notifier support #11554 required adding some Grafana process management to reload some configuration)
  • The frontend is already configured to talk to the Prometheus container for information about alerts as part of https://github.com/sourcegraph/sourcegraph/pull/10704, which makes responsibilities a bit clearer than the Grafana approach where it seemed like getting information about alerting would start to involve talking to different services
  • Alertmanager has a nice UI which I've reverse-proxied on prometheus:9090/alertmanager - might come in handy

Caveats:

  • This PR introduces some breaking changes in configuration to align it with Alertmanager's options - after discussion with @slimsag, we decided the number of admins taking advantage of the alerting introduced in 3.17 should be small enough that we should be fine just noting the breaking changes in the changelog. Going forward, I think it would be best if we don't align configuration directly with the options of our alerting provider like I did in monitoring: configure alert notifications from site config #11427, and just add minimal notifier fields on a kind of need-to-have basis
  • Logs of 3 services in one container are difficult to read, but I'm unsure how often we would look at Prometheus logs in practice. I considered leveraging a library that can add prefixes to lines written to an io.Writer, but only managed to find one simple but seemingly abandoned library, go-prefix-writer, so I've opted not to introduce the dependency for now
  • Somewhat duplicated definitions - to trigger Alertmanager, we need "real" Prometheus alerts, so now we have alert_count and Prometheus alerts

Best reviewed as a whole and ignoring all the Grafana changes

Issues

TODOs

  • manually test each individual notifier - the documentation indicates I'm setting the fields correctly and the Alertmanager UI indicates that changes are applying as expected, but I've yet to successfully receive a notification. I'm currently doing this by manually triggering alerts via:
     curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"robertrobert","level":"critical"}}]' localhost:9090/alerts/api/v1/alerts
    
    update: the config structs provided by alertmanager never marshal any secrets, so they are not being written to disk correctly, which is very unfortunate (Feature request: override secret censoring during marshaling to YAML prometheus/alertmanager#1985) - trying to figure out a way around this short of copy-pasting the definitions opened config: add MarshalSecrets flag prometheus/alertmanager#2316
  • find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual) => set low thresholds temporarily for some metrics. See samples in #alerts
  • changelog item communicating breaking changes

Non-goals

  • Linking alerts back to Grafana panels for the corresponding metric - this requires some work on our generator, and might not be very straightforward (seems only incremental IDs are available, and not UIDs, for panels)
  • Amazing templates for each notification type - the most lacking one right now is the email template, which is just a plain-text thing at the moment, but I think those are best addressed in follow-ups. The goal for this iteration is to have them be consistent (link back to solutions docs, indicate basic metadata)

bobheadxi added 14 commits June 29, 2020 09:58
commit bf833fe
Merge: f32b042 f142c9c
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Tue Jun 23 10:59:00 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit f32b042
Merge: 2327cd3 34bd181
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Tue Jun 23 09:53:35 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 2327cd3
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Tue Jun 23 09:53:27 2020 +0800

    remove todos

commit 3d48e18
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 13:58:14 2020 +0800

    make problem format more consistent

commit c5ba7db
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 13:54:49 2020 +0800

    report if grafana is down as part of config-subscriber problems

commit ef824fc
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 10:32:58 2020 +0800

    add default for alerts

commit 180fd43
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 10:32:50 2020 +0800

    disable external snapshots

commit 9626e47
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 09:51:48 2020 +0800

    add docstring

commit c1c1545
Merge: ac5749e 716ceb7
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Mon Jun 22 09:49:29 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit ac5749e
Merge: 9e819c4 c0c220b
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Fri Jun 19 08:21:19 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 9e819c4
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Fri Jun 19 08:21:09 2020 +0800

    improve notifier change behaviour

commit 4c4a7d2
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Thu Jun 18 11:47:06 2020 +0800

    fix return discard

commit 9a9be6b
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Thu Jun 18 11:21:04 2020 +0800

    address linter warnings

commit 7242ff3
Merge: a2b07d5 ce911b0
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Thu Jun 18 10:37:56 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit a2b07d5
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Thu Jun 18 10:37:37 2020 +0800

    add tests for grafana smtp change

commit 7cc2fe0
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 14:38:32 2020 +0800

    refactor to remove integration points

commit 2f60a4d
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 13:23:01 2020 +0800

    improve docs

commit 2981a3d
Merge: 664ebe6 ed5a2ab
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 13:21:06 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 664ebe6
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 13:20:59 2020 +0800

    implement smtp sync

commit 2c10270
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 13:12:37 2020 +0800

    regenerate stringdata

commit 03da223
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 13:03:42 2020 +0800

    dump grafana build output if build fails

commit 267bca0
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 09:44:28 2020 +0800

    improve docs

commit 1d81518
Merge: 3ca4892 254e676
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Wed Jun 17 09:23:34 2020 +0800

    Merge branch 'master' of github.com:sourcegraph/sourcegraph into distribution/grafana-smtp

commit 3ca4892
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Tue Jun 16 15:31:24 2020 +0800

    scaffold improvements to change application

commit 7162a4c
Author: Robert Lin <robert@bobheadxi.dev>
Date:   Tue Jun 16 13:05:57 2020 +0800

    restore email schema support
The notice doesn't add a lot of information to anyone who might receive the alert. Moved the disclaimer to PossibleSolutions instead
@bobheadxi bobheadxi requested review from a team, davejrt, emidoots and uwedeportivo July 1, 2020 01:24
@emidoots

emidoots commented Jul 1, 2020

Copy link
Copy Markdown
Member

This is excellent, thanks for the write-up! I will take a look at the implementation and try this out tomorrow.

@bobheadxi

Copy link
Copy Markdown
Member Author

[...] but I've yet to successfully receive a notification. I'm going this by [???]

It looks like your message got cut off here

I've fixed the point now :)

@emidoots

emidoots commented Jul 1, 2020

Copy link
Copy Markdown
Member

find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual)

It would be nice to have this happen automatically after saving changes to the alerting configuration - or even just a GraphQL query one can run in e.g. sourcegraph.com/api/console (but neither of this is needed for this PR and I don't want to increase its scope)

One easy way to do it would be to declare a a Prometheus metric like test_alert and define alerts on that through our regular monitoring stack, then simply increment that metric whenever you want it to fire

@uwedeportivo

Copy link
Copy Markdown
Contributor

this is very cool. i'll take a closer look tomorrow.

@unknwon

unknwon commented Jul 1, 2020

Copy link
Copy Markdown
Contributor

Do you think cloud teams should review this PR? (come here via code owner)

@emidoots emidoots removed the request for review from a team July 1, 2020 07:46
@emidoots

emidoots commented Jul 1, 2020

Copy link
Copy Markdown
Member

@unknwon I would say not necessary as Distribution owns this - but feel free to review if you like as usual of course. Will make sure CODEOWNERS file reflects this.

@tsenart

tsenart commented Jul 1, 2020

Copy link
Copy Markdown
Contributor

Can we dog-food this for Sourcegraph Cloud?

@bobheadxi

Copy link
Copy Markdown
Member Author

Can we dog-food this for Sourcegraph Cloud?

Yep! Once I finish this up, I'll update https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/2839 for Cloud (Cloud == dot-com right?)

Comment thread doc/admin/observability/alerting.md Outdated
bobheadxi and others added 2 commits July 7, 2020 08:24
Co-authored-by: Stephen Gutekanst <stephen.gutekanst@gmail.com>
Co-authored-by: Keegan Carruthers-Smith <keegan.csmith@gmail.com>
@bobheadxi bobheadxi force-pushed the distribution/prom-alerting branch from 47c5a62 to 527eed6 Compare July 7, 2020 00:28
Comment thread dev/prometheus.sh
Co-authored-by: davejrt <2067825+davejrt@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

7 participants