Skip to content

feat: add default endpoint probe and TLS expiry alerts#2530

Merged
jasonwashburn merged 10 commits intomainfrom
feat/core-72-add-alertmanager-uptime-and-tls-alerts
Mar 30, 2026
Merged

feat: add default endpoint probe and TLS expiry alerts#2530
jasonwashburn merged 10 commits intomainfrom
feat/core-72-add-alertmanager-uptime-and-tls-alerts

Conversation

@jasonwashburn
Copy link
Copy Markdown
Contributor

@jasonwashburn jasonwashburn commented Mar 24, 2026

Description

Adds default UDS Core probe alert rules for endpoint downtime and TLS certificate expiry, with Helm-configurable thresholds, durations, and severities.

Included changes

  • Adds default probe alerts via the uds-prometheus-config chart:
    • UDSProbeEndpointDown
    • UDSProbeTLSExpiryWarning
    • UDSProbeTLSExpiryCritical
  • Includes Helm unittest and Vitest coverage for default probe alerts
  • Updates monitoring docs across how-to, reference, concepts, and overview pages to document the new defaults and configuration paths

Related Issue

Fixes # CORE-72

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (security config, docs update, etc)

Steps to Validate

  1. Run uds run test-single-layer --set LAYER=monitoring to deploy and run e2e tests
  2. Use Grafana's web UI to verify presence and intended configuration of alerts

Checklist before merging

@jasonwashburn jasonwashburn self-assigned this Mar 24, 2026
@jasonwashburn jasonwashburn force-pushed the feat/core-72-add-alertmanager-uptime-and-tls-alerts branch from 69212c6 to cb075f2 Compare March 25, 2026 17:34
@jasonwashburn jasonwashburn marked this pull request as ready for review March 25, 2026 17:35
@jasonwashburn jasonwashburn requested a review from a team as a code owner March 25, 2026 17:35
Copilot AI review requested due to automatic review settings March 25, 2026 17:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opinionated, configurable default probe alerting to UDS Core’s monitoring stack (via the uds-prometheus-config chart), with accompanying Helm unit tests, Vitest E2E coverage, and documentation updates describing the new defaults and tuning paths.

Changes:

  • Add default PrometheusRule probe alerts for endpoint downtime and TLS certificate expiry, gated by Helm values.
  • Add Helm-unittest suites plus a new Vitest E2E test that validates the alert rules are loaded in Prometheus.
  • Update monitoring docs (concepts, reference, and how-to guides) to document the shipped rules and configuration knobs.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/vitest/default-probe-alerts.spec.ts New Vitest E2E test that polls Prometheus for the shipped probe alert rule names.
src/prometheus-stack/tasks.yaml Runs npm ci once and executes Vitest suites for Prometheus, default probe alerts, and blackbox exporter.
src/prometheus-stack/chart/values.yaml Adds udsCoreDefaultAlerts values surface (enablement, severities, durations, TLS day thresholds).
src/prometheus-stack/chart/tests/probe_alerting_rules_test.yaml Helm-unittest coverage for default rendering, toggles, and override behavior.
src/prometheus-stack/chart/tests/probe_alerting_rules_no_crd_test.yaml Ensures probe alerts do not render when the PrometheusRule CRD API version is unavailable.
src/prometheus-stack/chart/templates/probe-alerting-rules.yaml New PrometheusRule template implementing UDSProbeEndpointDown + TLS expiry warning/critical alerts.
docs/reference/configuration/monitoring-and-observability.md Documents shipped default probe alert rules, labels, and Helm configuration surface with examples.
docs/how-to-guides/monitoring-and-observability/set-up-uptime-monitoring.mdx Notes the existence of default probe alerts and points readers to tuning guidance.
docs/how-to-guides/monitoring-and-observability/overview.mdx Updates guide card text to reflect tuning of built-in probe defaults.
docs/how-to-guides/monitoring-and-observability/create-metric-alerting-rules.mdx Adds guidance and examples for tuning/disabling UDS Core probe defaults alongside upstream defaults.
docs/concepts/core-features/monitoring-observability.mdx Updates concepts to include default probe alert rules as part of built-in uptime monitoring.

Copy link
Copy Markdown
Contributor

@mjnagel mjnagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few smaller comments - nothing major overall.

Comment thread src/prometheus-stack/chart/values.yaml Outdated
Comment thread src/prometheus-stack/tasks.yaml Outdated
Comment thread src/prometheus-stack/chart/values.yaml
Comment thread docs/reference/configuration/monitoring-and-observability.md
Comment thread test/vitest/default-probe-alerts.spec.ts Outdated
Copy link
Copy Markdown
Contributor

@briantwatson briantwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, some comments below for consideration.

Comment thread docs/how-to-guides/monitoring-and-observability/create-metric-alerting-rules.mdx Outdated
Comment thread docs/how-to-guides/monitoring-and-observability/create-metric-alerting-rules.mdx Outdated
Comment thread docs/how-to-guides/monitoring-and-observability/create-metric-alerting-rules.mdx Outdated
Comment thread src/prometheus-stack/chart/values.yaml
@jasonwashburn jasonwashburn force-pushed the feat/core-72-add-alertmanager-uptime-and-tls-alerts branch from f0b2c4b to c474b34 Compare March 27, 2026 21:03
@jasonwashburn jasonwashburn merged commit 625527c into main Mar 30, 2026
47 of 53 checks passed
@jasonwashburn jasonwashburn deleted the feat/core-72-add-alertmanager-uptime-and-tls-alerts branch March 30, 2026 14:38
chance-coleman pushed a commit that referenced this pull request Apr 1, 2026
🤖 I have created a release *beep* *boop*
---


##
[1.1.0](v1.0.0...v1.1.0)
(2026-03-31)


### Features

* add default endpoint probe and TLS expiry alerts
([#2530](#2530))
([625527c](625527c))
* add support for image volumes in policy
([#2552](#2552))
([46b653e](46b653e))
* default uptime probe overrides
([#2520](#2520))
([0c80295](0c80295))


### Bug Fixes

* **docs:** llm friendly docs
([#2535](#2535))
([107f181](107f181))
* remove aggressive whitespace trimming in keycloak statefulset template
([#2539](#2539))
([231fa5c](231fa5c))


### Miscellaneous

* **ci:** cleanup old cve workflow
([#2550](#2550))
([f67afa8](f67afa8))
* **ci:** ensure concurrency on all workflows
([#2527](#2527))
([3ccf9ef](3ccf9ef))
* **deps-dev:** bump picomatch from 4.0.3 to 4.0.4 in /scripts/renovate
([#2538](#2538))
([ba0ed10](ba0ed10))
* **deps-dev:** bump picomatch from 4.0.3 to 4.0.4 in
/scripts/root-ca-retriever
([#2537](#2537))
([7bfaaa0](7bfaaa0))
* **deps-dev:** bump rollup from 4.57.1 to 4.60.1 in /docs/.c4
([#2551](#2551))
([abcb422](abcb422))
* **deps:** bump brace-expansion
([#2544](#2544))
([9cdc76e](9cdc76e))
* **deps:** bump flatted from 3.4.1 to 3.4.2
([#2512](#2512))
([7c08658](7c08658))
* **deps:** bump picomatch
([#2536](#2536))
([38baaa8](38baaa8))
* **deps:** bump yaml from 2.8.2 to 2.8.3 in /scripts/renovate
([#2542](#2542))
([f78b6df](f78b6df))
* **deps:** update iac-support-deps
([#2534](#2534))
([5098a93](5098a93))
* **deps:** update keycloak to v26.5.6
([#2502](#2502))
([ba6a2c0](ba6a2c0))
* **deps:** update pepr to v1.1.5
([#2540](#2540))
([21bc575](21bc575))
* **deps:** update prometheus-stack
([#2518](#2518))
([c9dfd05](c9dfd05))
* **deps:** update setup-uv (support dep) to v8
([#2548](#2548))
([2b55d1c](2b55d1c))
* **deps:** update support dependencies to v4.35.1
([#2545](#2545))
([a964617](a964617))
* **deps:** update UDS CLI to 0.30.0, Zarf init to 0.74.0
([#2526](#2526))
([bb0fed5](bb0fed5))
* **deps:** update velero
([#2405](#2405))
([609947e](609947e))
* **docs:** add release notes for 1.1.0
([#2555](#2555))
([3cc107e](3cc107e))
* **docs:** remove old doc images and diagrams
([#2521](#2521))
([7ef96c8](7ef96c8))
* **renovate:** add minimumReleaseAge for npm support dependencies
([#2553](#2553))
([94ff4d6](94ff4d6))
* **renovate:** set min release age for pepr to null
([#2554](#2554))
([ee5dbf0](ee5dbf0))
* replace use of `uds` with `./uds` in uds tasks
([#2541](#2541))
([d165ec6](d165ec6))
* update contributing doc link in PR template
([#2532](#2532))
([0651180](0651180))


### Documentation

* cleanup old doc sites references;cleanup readme
([#2525](#2525))
([c980914](c980914))
* fix incorrect link to configuration overview on reference overview
([#2533](#2533))
([32fe181](32fe181))
* loki storage configuration reference
([#2529](#2529))
([25bd0e7](25bd0e7))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants