feat: add Prometheus /metrics endpoint#884
Conversation
Design spec for an opt-in /metrics endpoint exposing group lock state, per-instance warmup time (provider start + end-to-end ready), session counters, and Go runtime/process collectors. Endpoint served by the existing gin server under base path; disabled by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD task plan corresponding to the design proposal in docs/proposals/2026-05-05-prometheus-metrics.md. Working artifact — may be omitted from the upstream PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire metrics.Recorder into OnInstanceExpired so that every store expiry records a stop counter and marks the instance inactive. Also scaffold buildRecorder in sabliercmd/start.go so the recorder is created and passed to the ServeStrategy, sablier.Sablier, and the expiry callback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Call s.metrics.RecordInstanceStop after successfully stopping an unregistered instance in stopFunc so the metric is emitted alongside the existing log line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add metrics.Recorder to ServeStrategy and call RecordSessionRequest
in both the dynamic and blocking strategy handlers, immediately after
the names/group XOR validation. Update the shared NewApiTest fixture
to populate Metrics: metrics.Noop{} so tests don't nil-panic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire up the flag that was documented in docs/configuration.md but never registered with cobra/viper, so it can be set from the command line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prevents stale timestamps from corrupting the ready-duration histogram when an instance is stopped externally before becoming Ready, then later re-requested with the same name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec content is captured in the PR description; the plan was a working TDD artifact for the implementer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in slog-gin 1.21.1, gin 1.12.0, podman 5.8.2 dependency bumps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pkg/metrics/prom_test.go: extract findMetric and assertNoHistogramSamples helpers, dedupe mustCounter/mustHistogramCount and the no-samples assertions - internal/api: extract recordSessionRequest helper from start_dynamic.go and start_blocking.go Drops new-code duplication below the 3% SonarCloud threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hello @jindrichskupa Thank you for this contribution! Would you be able to create an example in a folder: A small docker compose with a Makefile and a readme as a small runnable example to have a full setup. Do not include any reverse proxy in this example, we just want to make API calls and see the registered metrics in the prometheus instance. |
Codecov Report❌ Patch coverage is
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A minimal docker-compose stack with Sablier (metrics enabled), a whoami target labelled into the "demo" group, and a Prometheus instance scraping sablier:10000/metrics. No reverse proxy — drive the API with curl via the provided Makefile and observe the metrics in Prometheus. Addresses review request from @acouvreur on PR sablierapp#884. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @acouvreur — thanks for the quick review! Added in
The Happy to add a Grafana dashboard or split the example into multiple folders (e.g. one per provider) if you'd prefer — let me know. |
|
1 similar comment
|
I will try this out locally! If that "just works" I think we can merge it :) |
|
Hello @jindrichskupa can you resolve merge conflicts please? |
# Conflicts: # go.mod # pkg/sabliercmd/start.go
|
Hello @jindrichskupa can you rebase your changes again ? |
# Conflicts: # go.mod # go.sum
|
@acouvreur rebased onto current Conflicts resolved:
Local |
|
Hi. as a starting point, i made this grafana dashboard to make use of the metrics. |



Summary
Adds an opt-in Prometheus-compatible
/metricsendpoint to the existing Sablier HTTP server.Tracks group lock state (groups with at least one active session), per-instance warmup time (provider start duration + end-to-end not-ready→ready wall time), session request counters, instance start failures, instance stops, plus the standard Go runtime and process collectors.
The endpoint is disabled by default. Enable with
server.metrics.enabled: true.Motivation
Operators running Sablier in production currently have no built-in visibility into:
These are easy to answer once Sablier exposes Prometheus metrics that can be scraped, dashboarded, and alerted on with the rest of the user's observability stack.
Configuration
Equivalents:
--server.metrics.enabled/SERVER_METRICS_ENABLED=true.Endpoint
GET <base-path>/metrics. Served by the same gin server alongside/healthand/api/.... Registered only when enabled (returns 404 otherwise). Usespromhttp.HandlerForagainst a per-process registry.Metrics catalog
sablier_group_lockedgroupsablier_group_active_instancesgroupsablier_instance_start_duration_secondsinstancesablier_instance_ready_duration_secondsinstancesablier_session_requests_totalstrategy,targetsablier_instance_start_failures_totalinstancesablier_instance_stops_totalinstance,reasonHistogram buckets are sized for container start times:
[0.1, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, 300]seconds.The two
sablier_group_*gauges always emit one series per known group (zero-valued if no instances in the group are active) so alerting rules can use stable cardinality.Architecture
A new
pkg/metricspackage owns everything Prometheus-related:Recorderinterface — call surface for Sablier core and API handlersNoop— zero-overhead default; used when metrics are disabled, so call sites are branch-freePromRecorder— real implementation with all metric vectors and standard collectorsGroupLockCollector— customprometheus.Collectoremitting the two group gauges lazily at scrape time, readingSablier.Groups()and the recorder's active-instance setNewHandler— wrapspromhttp.HandlerForfor the gin serverSabliercore gets ametrics Recorderfield (defaults toNoop{}) and a small number of well-defined call points: inrequestStart(begin tracking, time the start call), inInstanceRequest(observe ready transition), in the store-expiration callback (record stop + clear active state + clear pending ready-wait), and inStopAllUnregisteredInstances(record stop withreason=\"unregistered\"). The API handlers increment the session request counter.Security
The endpoint exposes process internals, group/instance names, and counters. Documented as operator-level — restrict at the reverse proxy when Sablier is fronted on an untrusted network. This matches the existing posture of
/health. Adding auth is out of scope.Non-goals (deliberately left out)
/metrics— operators restrict at the reverse proxyinstancelabel is enough; strategy-level views can be built in PromQL by joining with the request counter/metricsonto a separate listener — single listener for nowCompatibility
Purely additive. No existing config or HTTP contract changes. Default is opt-out (disabled), so existing deployments see no change in behavior or response surface.
prometheus/client_golangwas already a transitive dependency, now promoted to direct.Test plan
pkg/metrics(recorder counters, histograms, ready-wait state machine, collector zero-value emission, HTTP handler exposition format) — 12 testspkg/sablierverifying metrics calls in the request flow (success, failure, ready transition) — uses fakeRecorderinternal/server—/metricsreturns 200 when enabled, 404 when disabled, respectsbase-pathgo vetclean across all touched packagespkg/sabliercmd— couldn't build that package locally due to a missing systemgpgmelibrary🤖 Generated with Claude Code