Skip to content

feat: add Prometheus /metrics endpoint#884

Merged
acouvreur merged 30 commits into
sablierapp:mainfrom
jindrichskupa:feat/prometheus-metrics
May 10, 2026
Merged

feat: add Prometheus /metrics endpoint#884
acouvreur merged 30 commits into
sablierapp:mainfrom
jindrichskupa:feat/prometheus-metrics

Conversation

@jindrichskupa

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in Prometheus-compatible /metrics endpoint to the existing Sablier HTTP server.

Tracks group lock state (groups with at least one active session), per-instance warmup time (provider start duration + end-to-end not-ready→ready wall time), session request counters, instance start failures, instance stops, plus the standard Go runtime and process collectors.

The endpoint is disabled by default. Enable with server.metrics.enabled: true.

Motivation

Operators running Sablier in production currently have no built-in visibility into:

  • Which groups are actively held warm by user traffic right now?
  • How long does it take a group to warm up after the first request hits it? (User-perceived latency on the blocking strategy.)
  • Are provider start calls failing? How often?
  • Is the Sablier process itself healthy (heap, goroutines, GC, CPU)?

These are easy to answer once Sablier exposes Prometheus metrics that can be scraped, dashboarded, and alerted on with the rest of the user's observability stack.

Configuration

server:
  port: 10000
  base-path: /
  metrics:
    enabled: true   # default: false

Equivalents: --server.metrics.enabled / SERVER_METRICS_ENABLED=true.

Endpoint

GET <base-path>/metrics. Served by the same gin server alongside /health and /api/.... Registered only when enabled (returns 404 otherwise). Uses promhttp.HandlerFor against a per-process registry.

Metrics catalog

Name Type Labels
sablier_group_locked gauge group
sablier_group_active_instances gauge group
sablier_instance_start_duration_seconds histogram instance
sablier_instance_ready_duration_seconds histogram instance
sablier_session_requests_total counter strategy, target
sablier_instance_start_failures_total counter instance
sablier_instance_stops_total counter instance, reason
Go runtime + process collectors (default) (default)

Histogram buckets are sized for container start times: [0.1, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, 300] seconds.

The two sablier_group_* gauges always emit one series per known group (zero-valued if no instances in the group are active) so alerting rules can use stable cardinality.

Architecture

A new pkg/metrics package owns everything Prometheus-related:

  • Recorder interface — call surface for Sablier core and API handlers
  • Noop — zero-overhead default; used when metrics are disabled, so call sites are branch-free
  • PromRecorder — real implementation with all metric vectors and standard collectors
  • GroupLockCollector — custom prometheus.Collector emitting the two group gauges lazily at scrape time, reading Sablier.Groups() and the recorder's active-instance set
  • NewHandler — wraps promhttp.HandlerFor for the gin server

Sablier core gets a metrics Recorder field (defaults to Noop{}) and a small number of well-defined call points: in requestStart (begin tracking, time the start call), in InstanceRequest (observe ready transition), in the store-expiration callback (record stop + clear active state + clear pending ready-wait), and in StopAllUnregisteredInstances (record stop with reason=\"unregistered\"). The API handlers increment the session request counter.

Security

The endpoint exposes process internals, group/instance names, and counters. Documented as operator-level — restrict at the reverse proxy when Sablier is fronted on an untrusted network. This matches the existing posture of /health. Adding auth is out of scope.

Non-goals (deliberately left out)

  • Authentication on /metrics — operators restrict at the reverse proxy
  • OpenTelemetry / OTLP export — separate, larger discussion
  • Per-strategy histograms — instance label is enough; strategy-level views can be built in PromQL by joining with the request counter
  • Tracing spans
  • Persisting active-instance state across restarts (in-memory store correctly resets to 0; Valkey store repopulates as requests come in)
  • Splitting /metrics onto a separate listener — single listener for now

Compatibility

Purely additive. No existing config or HTTP contract changes. Default is opt-out (disabled), so existing deployments see no change in behavior or response surface. prometheus/client_golang was already a transitive dependency, now promoted to direct.

Test plan

  • Unit tests for pkg/metrics (recorder counters, histograms, ready-wait state machine, collector zero-value emission, HTTP handler exposition format) — 12 tests
  • Unit tests for pkg/sablier verifying metrics calls in the request flow (success, failure, ready transition) — uses fakeRecorder
  • Integration tests for internal/server/metrics returns 200 when enabled, 404 when disabled, respects base-path
  • go vet clean across all touched packages
  • Reviewer please verify CI passes on pkg/sabliercmd — couldn't build that package locally due to a missing system gpgme library

🤖 Generated with Claude Code

jindrichskupa and others added 23 commits May 5, 2026 15:39
Design spec for an opt-in /metrics endpoint exposing group lock state,
per-instance warmup time (provider start + end-to-end ready), session
counters, and Go runtime/process collectors. Endpoint served by the
existing gin server under base path; disabled by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD task plan corresponding to the design proposal in
docs/proposals/2026-05-05-prometheus-metrics.md. Working artifact —
may be omitted from the upstream PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire metrics.Recorder into OnInstanceExpired so that every store expiry
records a stop counter and marks the instance inactive. Also scaffold
buildRecorder in sabliercmd/start.go so the recorder is created and
passed to the ServeStrategy, sablier.Sablier, and the expiry callback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Call s.metrics.RecordInstanceStop after successfully stopping an
unregistered instance in stopFunc so the metric is emitted alongside
the existing log line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add metrics.Recorder to ServeStrategy and call RecordSessionRequest
in both the dynamic and blocking strategy handlers, immediately after
the names/group XOR validation. Update the shared NewApiTest fixture
to populate Metrics: metrics.Noop{} so tests don't nil-panic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire up the flag that was documented in docs/configuration.md but never
registered with cobra/viper, so it can be set from the command line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prevents stale timestamps from corrupting the ready-duration histogram
when an instance is stopped externally before becoming Ready, then later
re-requested with the same name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec content is captured in the PR description; the plan was a working
TDD artifact for the implementer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jindrichskupa jindrichskupa requested a review from acouvreur as a code owner May 7, 2026 08:59
jindrichskupa and others added 2 commits May 7, 2026 11:46
Brings in slog-gin 1.21.1, gin 1.12.0, podman 5.8.2 dependency bumps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pkg/metrics/prom_test.go: extract findMetric and assertNoHistogramSamples
  helpers, dedupe mustCounter/mustHistogramCount and the no-samples assertions
- internal/api: extract recordSessionRequest helper from start_dynamic.go and
  start_blocking.go

Drops new-code duplication below the 3% SonarCloud threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@acouvreur

Copy link
Copy Markdown
Member

Hello @jindrichskupa

Thank you for this contribution! Would you be able to create an example in a folder: examples/metrics/prometheus with a runnable example ?

A small docker compose with a Makefile and a readme as a small runnable example to have a full setup.

Do not include any reverse proxy in this example, we just want to make API calls and see the registered metrics in the prometheus instance.

@codecov

codecov Bot commented May 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.08247% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
pkg/sablier/sablier.go 21.42% 10 Missing and 1 partial ⚠️
pkg/sabliercmd/start.go 0.00% 10 Missing ⚠️
pkg/sablier/instance_expired.go 0.00% 4 Missing ⚠️
pkg/config/server.go 0.00% 1 Missing ⚠️
pkg/metrics/recorder.go 88.88% 1 Missing ⚠️
Files with missing lines Coverage Δ
internal/api/api.go 100.00% <100.00%> (ø)
internal/api/start_blocking.go 100.00% <100.00%> (ø)
internal/api/start_dynamic.go 76.19% <100.00%> (+0.28%) ⬆️
internal/server/routes.go 100.00% <100.00%> (+100.00%) ⬆️
pkg/metrics/collector.go 100.00% <100.00%> (ø)
pkg/metrics/handler.go 100.00% <100.00%> (ø)
pkg/metrics/prom.go 100.00% <100.00%> (ø)
pkg/sablier/autostop.go 92.00% <100.00%> (+0.33%) ⬆️
pkg/sablier/instance_request.go 91.46% <100.00%> (+4.79%) ⬆️
pkg/sabliercmd/root.go 97.53% <100.00%> (+0.06%) ⬆️
... and 5 more

... and 3 files with indirect coverage changes

jindrichskupa and others added 2 commits May 7, 2026 15:46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A minimal docker-compose stack with Sablier (metrics enabled), a whoami
target labelled into the "demo" group, and a Prometheus instance scraping
sablier:10000/metrics. No reverse proxy — drive the API with curl via the
provided Makefile and observe the metrics in Prometheus.

Addresses review request from @acouvreur on PR sablierapp#884.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jindrichskupa

Copy link
Copy Markdown
Contributor Author

Hi @acouvreur — thanks for the quick review!

Added in c9f21f6. The new directory examples/metrics/prometheus/ contains:

  • docker-compose.yml — three services: sablier (metrics enabled), whoami (labelled into group demo), prometheus (scraping sablier:10000/metrics every 5s). No reverse proxy.
  • sablier.yaml — config with server.metrics.enabled: true and a short sessions.default-duration: 1m so the demo loop is quick.
  • prometheus.yml — single scrape job targeting Sablier.
  • Makefileup / down / logs / ps / metrics plus request-dynamic, request-blocking, stop-target to drive the API.
  • README.md — quickstart, suggested demo loop, useful PromQL queries.

The image: tag uses 1.11.2 # x-release-please-version (matching the convention in docs/getting-started.md) so release-please will auto-bump it on the next release. The README notes that the /metrics endpoint requires this PR to be released, with a pointer to using a build: directive against the local repo for testing pre-release.

Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know.

@sonarqubecloud

sonarqubecloud Bot commented May 7, 2026

Copy link
Copy Markdown

1 similar comment
@sonarqubecloud

sonarqubecloud Bot commented May 7, 2026

Copy link
Copy Markdown

@acouvreur

Copy link
Copy Markdown
Member

Hi @acouvreur — thanks for the quick review!

Added in c9f21f6. The new directory examples/metrics/prometheus/ contains:

* `docker-compose.yml` — three services: `sablier` (metrics enabled), `whoami` (labelled into group `demo`), `prometheus` (scraping `sablier:10000/metrics` every 5s). No reverse proxy.

* `sablier.yaml` — config with `server.metrics.enabled: true` and a short `sessions.default-duration: 1m` so the demo loop is quick.

* `prometheus.yml` — single scrape job targeting Sablier.

* `Makefile` — `up` / `down` / `logs` / `ps` / `metrics` plus `request-dynamic`, `request-blocking`, `stop-target` to drive the API.

* `README.md` — quickstart, suggested demo loop, useful PromQL queries.

The image: tag uses 1.11.2 # x-release-please-version (matching the convention in docs/getting-started.md) so release-please will auto-bump it on the next release. The README notes that the /metrics endpoint requires this PR to be released, with a pointer to using a build: directive against the local repo for testing pre-release.

Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know.

Hi @acouvreur — thanks for the quick review!

Added in c9f21f6. The new directory examples/metrics/prometheus/ contains:

* `docker-compose.yml` — three services: `sablier` (metrics enabled), `whoami` (labelled into group `demo`), `prometheus` (scraping `sablier:10000/metrics` every 5s). No reverse proxy.

* `sablier.yaml` — config with `server.metrics.enabled: true` and a short `sessions.default-duration: 1m` so the demo loop is quick.

* `prometheus.yml` — single scrape job targeting Sablier.

* `Makefile` — `up` / `down` / `logs` / `ps` / `metrics` plus `request-dynamic`, `request-blocking`, `stop-target` to drive the API.

* `README.md` — quickstart, suggested demo loop, useful PromQL queries.

The image: tag uses 1.11.2 # x-release-please-version (matching the convention in docs/getting-started.md) so release-please will auto-bump it on the next release. The README notes that the /metrics endpoint requires this PR to be released, with a pointer to using a build: directive against the local repo for testing pre-release.

Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know.

I will try this out locally! If that "just works" I think we can merge it :)

@acouvreur

Copy link
Copy Markdown
Member

Hello @jindrichskupa can you resolve merge conflicts please?

# Conflicts:
#	go.mod
#	pkg/sabliercmd/start.go
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026
@acouvreur

Copy link
Copy Markdown
Member

Hello @jindrichskupa can you rebase your changes again ?
Sorry, another contribution was merged before yours

@jindrichskupa

Copy link
Copy Markdown
Contributor Author

@acouvreur rebased onto current main. Two merge commits in this push (59f957b brought in the moby/moby + raw InstanceInfo refactor, f3e41ee brought in the Proxmox LXC provider).

Conflicts resolved:

  • go.mod / go.sum — kept our prometheus deps and integrated the Proxmox additions; ran go mod tidy clean.
  • pkg/sabliercmd/start.go — kept both the metrics import and the new provpkg import.
  • pkg/sablier/instance_request_test.go — updated my three new tests for the renamed InstanceStatus constants and the now-removed NotReadyInstanceState/ReadyInstanceState factories. They now build InstanceInfo{} literals matching the pattern used by the other tests in that file.

Local go build ./..., go vet ./..., and tests on pkg/sablier, pkg/metrics, internal/api, internal/server, pkg/sabliercmd, pkg/config are all green. PR shows MERGEABLE. Happy to squash the three merge commits into one if you'd prefer a flatter history before merge.

@acouvreur acouvreur merged commit b0a0237 into sablierapp:main May 10, 2026
3 of 4 checks passed
@sablier-bot sablier-bot Bot mentioned this pull request May 9, 2026
@acouvreur acouvreur mentioned this pull request May 14, 2026
@matoy

matoy commented May 15, 2026

Copy link
Copy Markdown

Hi. as a starting point, i made this grafana dashboard to make use of the metrics.
grafana-dashboard.json
maybe you can improve it and/or put it in the exemples of a future release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants