Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

prometheus: alert dashboard links with fixed timestamps#17034

Merged
bobheadxi merged 6 commits into
mainfrom
prom/link-to-fixed-time
Jan 12, 2021
Merged

prometheus: alert dashboard links with fixed timestamps#17034
bobheadxi merged 6 commits into
mainfrom
prom/link-to-fixed-time

Conversation

@bobheadxi

@bobheadxi bobheadxi commented Jan 5, 2021

Copy link
Copy Markdown
Member

While working on https://github.com/sourcegraph/sourcegraph/pull/17014 I added a relative timestamp to the dashboard link in alerts, did a bit of fenangling to make the link completely fixed to timestamps associated with the delivered alert.

We can depend on (index .Alerts 0) because our grouping strategy ensures each group delivered only has one alert.

I'm still a bit unsure about this now that I've got it working, the experience is a bit less than ideal for alerts that e.g. are spikes lasting a second, since then we get a link to a panel that has a tiny window. An alternative is time and time.window, but that might not be great for alerts lasting longer. It is not possible to do arithmetic on these timestamps in alertmanager templates: prometheus/alertmanager#1188

=> update: see https://github.com/sourcegraph/sourcegraph/pull/17034#issuecomment-756598154

@bobheadxi bobheadxi requested review from a team and pecigonzalo January 5, 2021 08:34
@bobheadxi bobheadxi changed the title monitoring: make alertmanager delivery error alert more sensitive prometheus: alert dashboard links with fixed timestamps Jan 5, 2021
@sourcegraph-bot

sourcegraph-bot commented Jan 5, 2021

Copy link
Copy Markdown
Contributor

Notifying subscribers in CODENOTIFY files for diff 1851376...ffa11b5.

No notifications.

@codecov

codecov Bot commented Jan 5, 2021

Copy link
Copy Markdown

Codecov Report

Merging #17034 (ffa11b5) into main (1851376) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #17034      +/-   ##
==========================================
- Coverage   51.98%   51.98%   -0.01%     
==========================================
  Files        1703     1703              
  Lines       84786    84788       +2     
  Branches     7524     7666     +142     
==========================================
- Hits        44079    44077       -2     
- Misses      36806    36808       +2     
- Partials     3901     3903       +2     
Flag Coverage Δ
go 51.02% <100.00%> (-0.01%) ⬇️
integration 30.54% <ø> (ø)
storybook 30.03% <ø> (ø)
typescript 54.30% <ø> (ø)
unit 34.80% <ø> (ø)
Impacted Files Coverage Δ
...er-images/prometheus/cmd/prom-wrapper/receivers.go 66.84% <100.00%> (+0.35%) ⬆️
.../internal/codeintel/resolvers/graphql/locations.go 79.38% <0.00%> (-4.13%) ⬇️

@pecigonzalo

Copy link
Copy Markdown
Contributor

@bobheadxi You can use something like &time=1609931477000&time.window=3600000 instead, check https://grafana.com/docs/grafana/latest/dashboards/time-range-controls/#control-the-time-range-using-a-url

@bobheadxi

bobheadxi commented Jan 6, 2021

Copy link
Copy Markdown
Member Author

I considered that (see PR description):

An alternative is time and time.window, but that might not be great for alerts lasting longer.

It's not a great experience if the alert spans, say, 24 hours of problems (the link would only show a few minutes or whatever value we set there)

@pecigonzalo

Copy link
Copy Markdown
Contributor

I see, I did not notice that in the body my bad. I think defaulting to 1h or inferring time.window from something like the alert period * 1.5 should be ok.

Eg if CPU utilization alerts when its > 50 for 5m we link to a dashboard of time of alert and time.window of 7.5m

@bobheadxi

Copy link
Copy Markdown
Member Author

inferring time.window from something like the alert period * 1.5 should be ok.

Unfortunately you cannot do arithmetic in alertmanager templates :/ However, I think I found a nice middle-ground strategy with e3f166f:

  • If start and end available, link to a fixed timeframe on the start and end
  • If end is not available (alert still active), link to start and window of 1 hour

@pecigonzalo

Copy link
Copy Markdown
Contributor

Unfortunately you cannot do arithmetic in alertmanager templates

But couldnt we set that in the generated alert file when its generated by the generator?

@bobheadxi

Copy link
Copy Markdown
Member Author

But couldnt we set that in the generated alert file when its generated by the generator?

We can't infer period * 1.5 for example (assuming by period you mean the time the alert is active - or you mean the for parameter? these are often just 0 or a small number)

@pecigonzalo

Copy link
Copy Markdown
Contributor

We can't infer period * 1.5 for example (assuming by period you mean the time the alert is active - or you mean the for parameter? these are often just 0 or a small number)

I meant the for parameter. To be honest, I think just using 1h for the link is also fine. /shrug

@bobheadxi

Copy link
Copy Markdown
Member Author

Merging for now, can see how it works in practice :) this will be nice for looking at past alerts

@bobheadxi bobheadxi merged commit fa28c07 into main Jan 12, 2021
@bobheadxi bobheadxi deleted the prom/link-to-fixed-time branch January 12, 2021 10:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants