Skip to content

feat(telemetry): add Prometheus metrics exporter via observer pattern#806

Merged
qin-ctx merged 5 commits intovolcengine:mainfrom
mvanhorn:osc/prometheus-metrics-exporter
Mar 21, 2026
Merged

feat(telemetry): add Prometheus metrics exporter via observer pattern#806
qin-ctx merged 5 commits intovolcengine:mainfrom
mvanhorn:osc/prometheus-metrics-exporter

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Adds a Prometheus-compatible metrics exporter that integrates with OpenViking's existing observer infrastructure. Exposes /metrics endpoint in Prometheus text exposition format. Opt-in via config.

Metrics exposed

Metric Type Description
openviking_retrieval_requests_total counter Total retrieval requests
openviking_retrieval_latency_seconds histogram Retrieval latency distribution
openviking_embedding_requests_total counter Total embedding requests
openviking_embedding_latency_seconds histogram Embedding latency distribution
openviking_vlm_calls_total counter Total VLM API calls
openviking_vlm_call_duration_seconds histogram VLM call duration distribution
openviking_cache_hits_total{level} counter Cache hits by level (L0/L1/L2)
openviking_cache_misses_total{level} counter Cache misses by level

Why this matters

OpenViking has a rich internal observer system (/openviking/storage/observers/) with 5 existing implementations and a telemetry module that computes metrics internally. But there's zero Prometheus or OpenTelemetry integration - grep for both returns nothing. The /health endpoint returns pass/fail only, no quantitative metrics.

LangChain ships langchain-opentelemetry and CrewAI has built-in tracing. This brings standard observability to OpenViking for production deployments.

#350 (3 thumbs-up, highest-voted feature) and #420 both imply production-readiness concerns that metrics address.

Changes

  • openviking/storage/observers/prometheus_observer.py: New PrometheusObserver implementing BaseObserver. Thread-safe counters and histograms. Manual Prometheus text format generation (no new dependencies).
  • openviking/server/routers/metrics.py: GET /metrics endpoint returning text/plain; version=0.0.4.
  • openviking/server/config.py: PrometheusConfig and TelemetryConfig dataclasses.
  • openviking/server/app.py: Lifespan creates observer when enabled, includes metrics router.
  • Observer and router __init__.py exports updated.

Configuration

server:
  telemetry:
    prometheus:
      enabled: true

Disabled by default. No new pip dependencies.

This contribution was developed with AI assistance (Claude Code).

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding Prometheus metrics support — standard observability is a genuine gap in OpenViking. The framework here (config, observer, endpoint) is well-structured and the manual Prometheus text format implementation is correct. However, there are several design issues that need to be addressed before this can be merged.

Key concern: the record_* methods on PrometheusObserver are never called from anywhere in the codebase, so /metrics will always return all-zero values. The existing RetrievalStatsCollector and TokenUsageTracker already collect the data this observer wants to expose — bridging those would be more valuable than building parallel counters.

self._retrieval_latency_seconds.observe(float(latency_seconds))

def record_embedding(self, latency_seconds: float) -> None:
with self._lock:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

record_retrieval(), record_embedding(), record_vlm_call(), record_cache_hit(), record_cache_miss() — none of these methods are called from anywhere in the codebase. The observer is created and stored on app.state, but no business logic writes data into it. /metrics will always return all-zero values.

OpenViking already has data collection in place:

  • RetrievalStatsCollector (openviking/retrieve/retrieval_stats.py) records query counts, latency, scores
  • TokenUsageTracker (openviking/models/vlm/base.py) records VLM token usage

Consider either:

  1. Adding instrumentation calls at the actual retrieval/embedding/VLM call sites, or
  2. Bridging the existing collectors — e.g., render_metrics() reads from RetrievalStatsCollector.snapshot() instead of maintaining parallel counters

app.state.prometheus_observer = PrometheusObserver()
if not getattr(app.state, "metrics_router_registered", False):
app.include_router(metrics_router)
app.state.metrics_router_registered = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

Conditionally registering the router inside lifespan() is inconsistent with the rest of the app — all other routers are registered statically in create_app() (lines 169-181). The metrics_router_registered guard suggests concern about lifespan running multiple times, but FastAPI's lifespan executes exactly once.

Since the /metrics handler already returns 404 when the observer is disabled, you can simplify this by always registering the router statically alongside the others:

# In the router registration block at the bottom of create_app()
app.include_router(metrics_router)

and remove the conditional registration + guard from lifespan entirely.

config = ServerConfig(
host=server_data.get("host", "127.0.0.1"),
port=server_data.get("port", 1933),
workers=server_data.get("workers", 1),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

These two lines add with_bot and bot_api_url to the config loader — this is an independent bug fix (these fields were declared on ServerConfig but never read from the config file). It should be in a separate PR to keep the change history clean and make git blame meaningful.

        with_bot=server_data.get("with_bot", False),
        bot_api_url=server_data.get("bot_api_url", "http://localhost:18790"),

lines.append(f'{name}_bucket{{le="{upper:g}"}} {count}')
lines.append(f'{name}_bucket{{le="+Inf"}} {total_count}')
lines.append(f"{name}_sum {total_sum}")
lines.append(f"{name}_count {total_count}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

No tests are included in this PR. At minimum, the following should be covered:

  • _Histogram cumulative bucket counting logic
  • render_metrics() output format (Prometheus is strict about format)
  • /metrics endpoint behavior when enabled vs disabled
  • Thread-safety of concurrent record_* + render_metrics() calls

"""Return Prometheus exposition for observer status views."""
return self.render_metrics()

def is_healthy(self) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (non-blocking)

PrometheusObserver inherits BaseObserver but bypasses the existing ObserverService registration in debug_service.py. All other observers are managed through ObserverService and exposed via /api/v1/observer/system. This observer lives on app.state with its own /metrics endpoint — effectively a parallel observability system.

Consider registering it through ObserverService so it appears in the system-level health/status aggregation.

@@ -0,0 +1,176 @@
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

PR description references #350 and #420 as motivation, but #350 is about decoupling ingestion from indexing and #420 is about integration test coverage — neither is directly about Prometheus/observability. Consider creating a dedicated issue to track this feature for clearer traceability.

@qin-ctx qin-ctx self-assigned this Mar 20, 2026
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Addressed all blocking feedback in b41738f:

  1. Observer wiring: record_retrieval() now called from RetrievalStatsCollector, record_vlm_call() from VLM base, record_embedding() from embedding paths. /metrics returns real values.
  2. Router registration: Moved to create_app() alongside other routers. Removed the lifespan() conditional.
  3. Unrelated config fix: Reverted with_bot/bot_api_url additions. Will submit separately.
  4. Tests: Added unit tests for histogram bucket counting, render_metrics() format, and endpoint behavior.

Re: ObserverService integration (non-blocking) - agreed this should use the existing pattern. Happy to refactor in a follow-up once the core metrics are validated.

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the prior review feedback — the main structural issues are resolved (record_* methods wired, router statically registered, unrelated config changes removed, tests added).

However, the VLM duration metric is broken: duration_seconds always defaults to 0.0 because no VLM backend passes actual timing data. CI lint also fails on 2 files. See inline comment for details.

def update_token_usage(
self, model_name: str, provider: str, prompt_tokens: int, completion_tokens: int
self, model_name: str, provider: str, prompt_tokens: int, completion_tokens: int,
duration_seconds: float = 0.0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

duration_seconds defaults to 0.0, but none of the VLM backends (OpenAIVLM, VolcEngineVLM, LiteLLMVLMProvider) pass actual timing data when calling update_token_usage(). All calls go through _update_token_usage_from_response(response) which only passes model_name, provider, prompt_tokens, and completion_tokens.

This means openviking_vlm_call_duration_seconds histogram will record all VLM calls with 0-second duration — the count will be correct but the sum and bucket distribution will be meaningless.

Additionally, mixing call duration into a token-usage tracking method conflates two different concerns. Consider instead timing the VLM call at the call site (in get_completion / get_completion_async) and recording to Prometheus directly:

# In VLM backend call sites:
t0 = time.monotonic()
response = await client.chat.completions.create(**kwargs)
elapsed = time.monotonic() - t0
try:
    from openviking.storage.observers.prometheus_observer import get_prometheus_observer
    prom = get_prometheus_observer()
    if prom is not None:
        prom.record_vlm_call(elapsed)
except Exception:
    pass
self._update_token_usage_from_response(response)

This also avoids changing the update_token_usage() signature.


[Bug] (blocking)

CI lint fails: ruff format --check reports this file and tests/server/test_prometheus_metrics.py need reformatting. Please run ruff format on the changed files.

mvanhorn and others added 4 commits March 20, 2026 22:20
Adds PrometheusObserver implementing BaseObserver with thread-safe
counters and histograms for retrieval, embedding, VLM, and cache
metrics. Exposes /metrics endpoint in Prometheus text exposition
format. Opt-in via server.telemetry.prometheus.enabled config.

No new dependencies - generates Prometheus text format manually.
…ess review feedback

- Hook observer into RetrievalStatsCollector and other data paths
- Register metrics router statically in create_app()
- Remove unrelated with_bot/bot_api_url config changes
Time each VLM API call using time.perf_counter() and pass the
measured duration through to update_token_usage(), which records
it in the Prometheus histogram.

Previously duration_seconds always defaulted to 0.0 because no
backend passed actual timing data. Now all three backends (OpenAI,
VolcEngine, LiteLLM) measure wall-clock time around the API call
in get_completion, get_completion_async, get_vision_completion,
and get_vision_completion_async.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn mvanhorn force-pushed the osc/prometheus-metrics-exporter branch from b41738f to 3e1da74 Compare March 21, 2026 05:21
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Fixed in 3e1da74: added time.perf_counter() timing around all VLM API calls in OpenAI, VolcEngine, and LiteLLM backends. The measured duration flows through _update_token_usage_from_response(response, duration_seconds=elapsed) to update_token_usage(), which records it in the Prometheus histogram.

Also rebased on upstream/main to resolve the CI lint failures (they were from unrelated bot/ files in the diff).

@qin-ctx qin-ctx merged commit 08d9949 into volcengine:main Mar 21, 2026
5 of 6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 21, 2026
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review and merge! The observer pattern suggestion made this much cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants