Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations #12010

@bobheadxi

Description

@bobheadxi

Context

RFC 189 suggests that we might want engineering teams to own their own alerts. The RFC does not yet thoroughly detail this, but this would likely entail:

  • Some way to automatically send alerts to appropriate teams to handle
  • Some way to denote ownership of alerts (defined in package /monitoring)

Proposal

Simple alert routing

https://github.com/sourcegraph/sourcegraph/pull/11832 adds site-config-based notification definitions via Prometheus Alertmanager. We can extend the observability.alerts to accomodate matching on a small set of labels, for example:

{
  "level": "critical",
  "notifier": {
    "type": "opsgenie",
    "apiKey": "xxx",
    "responders": [ ... ]
  },
+ "onLabels": {
+   "service": [ "git_server", "frontend" ]
+ }
}

This alone might be a sufficient (if tedious) way to help teams own their own alerts. It might also cause some alerts to remain unowned. We could also restrict the implied breadth of onLabels field from above and just routing fields be a top-level option (example in the next point)

Denoting ownership

Ideally, whatever we do to denote ownership should not be Cloud-specific, ie it would be unpleasant to have to generate different alerting for Cloud. An additional required field could be added to our monitoring Observables to give panel a "product area" corresponding to the teams defined in sourcegraph/about#1150, for example:

{
	Name:            "disk_space_remaining",
	Description:     "disk space remaining by instance",
+	Owner:           OwnerSearch, // "search"
	Query:           `(src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100`,
	DataMayNotExist: true,
	Critical:        Alert{LessOrEqual: 5},
	PanelOptions:    PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
	PossibleSolutions: `
		- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
	`,
},

This would:

  • give us useful information on who to contact for an alert when provided alerts by customers (ie via the bug report page)
    • "a lot of alerts in group 'search' are firing, ask someone in the search team"
  • make it easier to maintain notifiers by being able to just use site config and the functionality developed to deploy/silence these notifications.
    • enable us to better dogfood alerts (we'll use the same templates, alert timings, etc. as customers)

Alternative naming for this field: Team, ProductArea, Group

A notifier configuration with the above ownership label them might look like:

{
  "level": "critical",
  "notifier": {
    "type": "opsgenie",
    "apiKey": "xxx",
    "responders": [ ... ]
  },
+ "owners": [ "search" ]
}

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions