feat: add Prometheus /metrics endpoint by jindrichskupa · Pull Request #884 · sablierapp/sablier

jindrichskupa · 2026-05-07T08:59:20Z

Summary

Adds an opt-in Prometheus-compatible /metrics endpoint to the existing Sablier HTTP server.

Tracks group lock state (groups with at least one active session), per-instance warmup time (provider start duration + end-to-end not-ready→ready wall time), session request counters, instance start failures, instance stops, plus the standard Go runtime and process collectors.

The endpoint is disabled by default. Enable with server.metrics.enabled: true.

Motivation

Operators running Sablier in production currently have no built-in visibility into:

Which groups are actively held warm by user traffic right now?
How long does it take a group to warm up after the first request hits it? (User-perceived latency on the blocking strategy.)
Are provider start calls failing? How often?
Is the Sablier process itself healthy (heap, goroutines, GC, CPU)?

These are easy to answer once Sablier exposes Prometheus metrics that can be scraped, dashboarded, and alerted on with the rest of the user's observability stack.

Configuration

server:
  port: 10000
  base-path: /
  metrics:
    enabled: true   # default: false

Equivalents: --server.metrics.enabled / SERVER_METRICS_ENABLED=true.

Endpoint

GET <base-path>/metrics. Served by the same gin server alongside /health and /api/.... Registered only when enabled (returns 404 otherwise). Uses promhttp.HandlerFor against a per-process registry.

Metrics catalog

Name	Type	Labels
`sablier_group_locked`	gauge	`group`
`sablier_group_active_instances`	gauge	`group`
`sablier_instance_start_duration_seconds`	histogram	`instance`
`sablier_instance_ready_duration_seconds`	histogram	`instance`
`sablier_session_requests_total`	counter	`strategy`, `target`
`sablier_instance_start_failures_total`	counter	`instance`
`sablier_instance_stops_total`	counter	`instance`, `reason`
Go runtime + process collectors	(default)	(default)

Histogram buckets are sized for container start times: [0.1, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, 300] seconds.

The two sablier_group_* gauges always emit one series per known group (zero-valued if no instances in the group are active) so alerting rules can use stable cardinality.

Architecture

A new pkg/metrics package owns everything Prometheus-related:

Recorder interface — call surface for Sablier core and API handlers
Noop — zero-overhead default; used when metrics are disabled, so call sites are branch-free
PromRecorder — real implementation with all metric vectors and standard collectors
GroupLockCollector — custom prometheus.Collector emitting the two group gauges lazily at scrape time, reading Sablier.Groups() and the recorder's active-instance set
NewHandler — wraps promhttp.HandlerFor for the gin server

Sablier core gets a metrics Recorder field (defaults to Noop{}) and a small number of well-defined call points: in requestStart (begin tracking, time the start call), in InstanceRequest (observe ready transition), in the store-expiration callback (record stop + clear active state + clear pending ready-wait), and in StopAllUnregisteredInstances (record stop with reason=\"unregistered\"). The API handlers increment the session request counter.

Security

The endpoint exposes process internals, group/instance names, and counters. Documented as operator-level — restrict at the reverse proxy when Sablier is fronted on an untrusted network. This matches the existing posture of /health. Adding auth is out of scope.

Non-goals (deliberately left out)

Authentication on /metrics — operators restrict at the reverse proxy
OpenTelemetry / OTLP export — separate, larger discussion
Per-strategy histograms — instance label is enough; strategy-level views can be built in PromQL by joining with the request counter
Tracing spans
Persisting active-instance state across restarts (in-memory store correctly resets to 0; Valkey store repopulates as requests come in)
Splitting /metrics onto a separate listener — single listener for now

Compatibility

Purely additive. No existing config or HTTP contract changes. Default is opt-out (disabled), so existing deployments see no change in behavior or response surface. prometheus/client_golang was already a transitive dependency, now promoted to direct.

Test plan

Unit tests for pkg/metrics (recorder counters, histograms, ready-wait state machine, collector zero-value emission, HTTP handler exposition format) — 12 tests
Unit tests for pkg/sablier verifying metrics calls in the request flow (success, failure, ready transition) — uses fakeRecorder
Integration tests for internal/server — /metrics returns 200 when enabled, 404 when disabled, respects base-path
go vet clean across all touched packages
Reviewer please verify CI passes on pkg/sabliercmd — couldn't build that package locally due to a missing system gpgme library

🤖 Generated with Claude Code

Design spec for an opt-in /metrics endpoint exposing group lock state, per-instance warmup time (provider start + end-to-end ready), session counters, and Go runtime/process collectors. Endpoint served by the existing gin server under base path; disabled by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bite-sized TDD task plan corresponding to the design proposal in docs/proposals/2026-05-05-prometheus-metrics.md. Working artifact — may be omitted from the upstream PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire metrics.Recorder into OnInstanceExpired so that every store expiry records a stop counter and marks the instance inactive. Also scaffold buildRecorder in sabliercmd/start.go so the recorder is created and passed to the ServeStrategy, sablier.Sablier, and the expiry callback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Call s.metrics.RecordInstanceStop after successfully stopping an unregistered instance in stopFunc so the metric is emitted alongside the existing log line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add metrics.Recorder to ServeStrategy and call RecordSessionRequest in both the dynamic and blocking strategy handlers, immediately after the names/group XOR validation. Update the shared NewApiTest fixture to populate Metrics: metrics.Noop{} so tests don't nil-panic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wire up the flag that was documented in docs/configuration.md but never registered with cobra/viper, so it can be set from the command line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prevents stale timestamps from corrupting the ready-duration histogram when an instance is stopped externally before becoming Ready, then later re-requested with the same name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spec content is captured in the PR description; the plan was a working TDD artifact for the implementer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings in slog-gin 1.21.1, gin 1.12.0, podman 5.8.2 dependency bumps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- pkg/metrics/prom_test.go: extract findMetric and assertNoHistogramSamples helpers, dedupe mustCounter/mustHistogramCount and the no-samples assertions - internal/api: extract recordSessionRequest helper from start_dynamic.go and start_blocking.go Drops new-code duplication below the 3% SonarCloud threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

acouvreur · 2026-05-07T12:17:05Z

Hello @jindrichskupa

Thank you for this contribution! Would you be able to create an example in a folder: examples/metrics/prometheus with a runnable example ?

A small docker compose with a Makefile and a readme as a small runnable example to have a full setup.

Do not include any reverse proxy in this example, we just want to make API calls and see the registered metrics in the prometheus instance.

codecov · 2026-05-07T12:20:26Z

Codecov Report

❌ Patch coverage is 86.08247% with 27 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
pkg/sablier/sablier.go	21.42%	10 Missing and 1 partial ⚠️
pkg/sabliercmd/start.go	0.00%	10 Missing ⚠️
pkg/sablier/instance_expired.go	0.00%	4 Missing ⚠️
pkg/config/server.go	0.00%	1 Missing ⚠️
pkg/metrics/recorder.go	88.88%	1 Missing ⚠️

Files with missing lines	Coverage Δ
internal/api/api.go	`100.00% <100.00%> (ø)`
internal/api/start_blocking.go	`100.00% <100.00%> (ø)`
internal/api/start_dynamic.go	`76.19% <100.00%> (+0.28%)`	⬆️
internal/server/routes.go	`100.00% <100.00%> (+100.00%)`	⬆️
pkg/metrics/collector.go	`100.00% <100.00%> (ø)`
pkg/metrics/handler.go	`100.00% <100.00%> (ø)`
pkg/metrics/prom.go	`100.00% <100.00%> (ø)`
pkg/sablier/autostop.go	`92.00% <100.00%> (+0.33%)`	⬆️
pkg/sablier/instance_request.go	`91.46% <100.00%> (+4.79%)`	⬆️
pkg/sabliercmd/root.go	`97.53% <100.00%> (+0.06%)`	⬆️
... and 5 more

... and 3 files with indirect coverage changes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@acouvreur

A minimal docker-compose stack with Sablier (metrics enabled), a whoami target labelled into the "demo" group, and a Prometheus instance scraping sablier:10000/metrics. No reverse proxy — drive the API with curl via the provided Makefile and observe the metrics in Prometheus. Addresses review request from @acouvreur on PR sablierapp#884. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jindrichskupa · 2026-05-07T14:11:52Z

Hi @acouvreur — thanks for the quick review!

Added in c9f21f6. The new directory examples/metrics/prometheus/ contains:

docker-compose.yml — three services: sablier (metrics enabled), whoami (labelled into group demo), prometheus (scraping sablier:10000/metrics every 5s). No reverse proxy.
sablier.yaml — config with server.metrics.enabled: true and a short sessions.default-duration: 1m so the demo loop is quick.
prometheus.yml — single scrape job targeting Sablier.
Makefile — up / down / logs / ps / metrics plus request-dynamic, request-blocking, stop-target to drive the API.
README.md — quickstart, suggested demo loop, useful PromQL queries.

The image: tag uses 1.11.2 # x-release-please-version (matching the convention in docs/getting-started.md) so release-please will auto-bump it on the next release. The README notes that the /metrics endpoint requires this PR to be released, with a pointer to using a build: directive against the local repo for testing pre-release.

Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know.

sonarqubecloud · 2026-05-07T14:12:10Z

Quality Gate passed

Issues
11 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.4% Duplication on New Code

See analysis details on SonarQube Cloud

sonarqubecloud · 2026-05-07T14:12:11Z

Quality Gate passed

Issues
11 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.4% Duplication on New Code

See analysis details on SonarQube Cloud

acouvreur · 2026-05-07T14:35:00Z

Hi @acouvreur — thanks for the quick review!

Added in c9f21f6. The new directory examples/metrics/prometheus/ contains:
* `docker-compose.yml` — three services: `sablier` (metrics enabled), `whoami` (labelled into group `demo`), `prometheus` (scraping `sablier:10000/metrics` every 5s). No reverse proxy.

* `sablier.yaml` — config with `server.metrics.enabled: true` and a short `sessions.default-duration: 1m` so the demo loop is quick.

* `prometheus.yml` — single scrape job targeting Sablier.

* `Makefile` — `up` / `down` / `logs` / `ps` / `metrics` plus `request-dynamic`, `request-blocking`, `stop-target` to drive the API.

* `README.md` — quickstart, suggested demo loop, useful PromQL queries.
The image: tag uses 1.11.2 # x-release-please-version (matching the convention in docs/getting-started.md) so release-please will auto-bump it on the next release. The README notes that the /metrics endpoint requires this PR to be released, with a pointer to using a build: directive against the local repo for testing pre-release.

Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know.

I will try this out locally! If that "just works" I think we can merge it :)

acouvreur · 2026-05-07T22:52:25Z

Hello @jindrichskupa can you resolve merge conflicts please?

# Conflicts: # go.mod # pkg/sabliercmd/start.go

acouvreur · 2026-05-09T11:50:43Z

Hello @jindrichskupa can you rebase your changes again ?
Sorry, another contribution was merged before yours

# Conflicts: # go.mod # go.sum

jindrichskupa · 2026-05-10T04:58:15Z

@acouvreur rebased onto current main. Two merge commits in this push (59f957b brought in the moby/moby + raw InstanceInfo refactor, f3e41ee brought in the Proxmox LXC provider).

Conflicts resolved:

go.mod / go.sum — kept our prometheus deps and integrated the Proxmox additions; ran go mod tidy clean.
pkg/sabliercmd/start.go — kept both the metrics import and the new provpkg import.
pkg/sablier/instance_request_test.go — updated my three new tests for the renamed InstanceStatus constants and the now-removed NotReadyInstanceState/ReadyInstanceState factories. They now build InstanceInfo{} literals matching the pattern used by the other tests in that file.

Local go build ./..., go vet ./..., and tests on pkg/sablier, pkg/metrics, internal/api, internal/server, pkg/sabliercmd, pkg/config are all green. PR shows MERGEABLE. Happy to squash the three merge commits into one if you'd prefer a flatter history before merge.

matoy · 2026-05-15T11:05:57Z

Hi. as a starting point, i made this grafana dashboard to make use of the metrics.
grafana-dashboard.json
maybe you can improve it and/or put it in the exemples of a future release?

jindrichskupa and others added 23 commits May 5, 2026 15:39

feat(config): add server.metrics.enabled (default false)

7e96b90

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(metrics): add Recorder interface and Noop implementation

4f79135

feat(metrics): add PromRecorder skeleton with Go and process collectors

9bddd7e

feat(metrics): record session requests, start failures, stops

b4e5ce1

feat(metrics): observe instance start duration histogram

1b8f383

feat(metrics): record end-to-end ready wait time per instance

8df47a6

feat(metrics): track active instances for group lock gauges

402c1fe

feat(metrics): GroupLockCollector emits per-group gauges

2067512

feat(metrics): HTTP handler factory for /metrics

a4a5937

feat(sablier): WithMetrics setter and Groups accessor

dcf5dcc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(sablier): emit metrics for instance starts and ready transitions

f9658fa

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(sablier): assert metrics calls in instance request flow

e7c5ccb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(sablier): record stop counter for unregistered instances

c41dd18

Call s.metrics.RecordInstanceStop after successfully stopping an unregistered instance in stopFunc so the metric is emitted alongside the existing log line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(server): expose /metrics endpoint when enabled

e16f77d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(cmd): register GroupLockCollector on metrics registry

d32afbb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(metrics): document server.metrics.enabled and /metrics endpoint

68d95a7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(cmd): register --server.metrics.enabled CLI flag

850ea1c

Wire up the flag that was documented in docs/configuration.md but never registered with cobra/viper, so it can be set from the command line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: remove design and plan docs from upstream PR

ef71429

Spec content is captured in the PR description; the plan was a working TDD artifact for the implementer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jindrichskupa requested a review from acouvreur as a code owner May 7, 2026 08:59

jindrichskupa and others added 2 commits May 7, 2026 11:46

chore: merge origin/main and resolve go.mod conflict

466a142

Brings in slog-gin 1.21.1, gin 1.12.0, podman 5.8.2 dependency bumps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jindrichskupa and others added 2 commits May 7, 2026 15:46

fix(metrics): satisfy errcheck on resp.Body.Close in handler test

d7476e2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into feat/prometheus-metrics

f1a4859

Merge remote-tracking branch 'origin/main' into feat/prometheus-metrics

59f957b

# Conflicts: # go.mod # pkg/sabliercmd/start.go

github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026

Merge remote-tracking branch 'origin/main' into feat/prometheus-metrics

f3e41ee

# Conflicts: # go.mod # go.sum

acouvreur merged commit b0a0237 into sablierapp:main May 10, 2026
3 of 4 checks passed

sablier-bot Bot mentioned this pull request May 9, 2026

chore(main): release 1.12.0 #870

Merged

acouvreur mentioned this pull request May 14, 2026

Add prometheus metrics #427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add Prometheus /metrics endpoint#884

feat: add Prometheus /metrics endpoint#884
acouvreur merged 30 commits into
sablierapp:mainfrom
jindrichskupa:feat/prometheus-metrics

jindrichskupa commented May 7, 2026

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

jindrichskupa commented May 7, 2026

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

acouvreur commented May 9, 2026

Uh oh!

jindrichskupa commented May 10, 2026

Uh oh!

Uh oh!

matoy commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

jindrichskupa commented May 7, 2026

Summary

Motivation

Configuration

Endpoint

Metrics catalog

Architecture

Security

Non-goals (deliberately left out)

Compatibility

Test plan

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jindrichskupa commented May 7, 2026

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

acouvreur commented May 7, 2026

Uh oh!

acouvreur commented May 9, 2026

Uh oh!

jindrichskupa commented May 10, 2026

Uh oh!

Uh oh!

matoy commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 7, 2026 •

edited

Loading