feat(observability): add Prometheus metrics endpoint#2494
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an optional in-process Prometheus-compatible metrics system (registry, exporter, and HTTP /metrics server), instruments blueprint execution, API validation, and sandbox lifecycle, conditionally registers a plugin service when enabled, and updates tests and docs to cover behavior and security caveats. (≤50 words) Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Plugin as "NemoClaw Plugin"
participant Registry as "MetricsRegistry (in-process)"
participant Server as "Metrics HTTP Server"
Client->>Plugin: perform action (plan/apply) or call API
Plugin->>Registry: observeOperation("blueprint_execution", labels, operation)
Registry->>Registry: increment counters / observe histograms (duration)
Plugin->>Registry: validateEndpointForMetrics(...)\nrecord api_validation
Client->>Server: GET /metrics
Server->>Registry: registry.renderPrometheus()
Registry-->>Server: Prometheus text payload
Server-->>Client: 200 + text/plain body
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemoclaw/eslint.config.mjs (1)
1-4:⚠️ Potential issue | 🟠 MajorAdd the required SPDX header to this source file.
This file is missing the repository-required license header block.
✅ Suggested patch
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + import tseslint from "@typescript-eslint/eslint-plugin"; import tsparser from "@typescript-eslint/parser"; import prettier from "eslint-config-prettier";As per coding guidelines
**/*.{js,mjs,ts,tsx,sh,md}: Every source file must include an SPDX license header.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/eslint.config.mjs` around lines 1 - 4, Add the required repository SPDX license header at the top of this module (above the import statements) so the file includes an SPDX-License-Identifier line and any required copyright/owner line; update the top of eslint.config.mjs (the module containing the import statements tseslint, tsparser, and prettier) to include the standard project header block exactly as used by other source files.
🧹 Nitpick comments (1)
nemoclaw/src/observability/server.ts (1)
38-53: Consider restricting to GET method for the /metrics endpoint.The handler currently serves
/metricsregardless of HTTP method. While Prometheus typically uses GET, other methods (POST, DELETE, etc.) would also receive the metrics response. This is minor since it's an internal metrics endpoint, but restricting to GET is more semantically correct.♻️ Optional: Restrict to GET method
export function createMetricsRequestHandler(registry: MetricsRegistry) { return (req: IncomingMessage, res: ServerResponse): void => { const url = new URL(req.url ?? "/", `http://${req.headers.host ?? "localhost"}`); - if (url.pathname !== "/metrics") { + if (req.method !== "GET" || url.pathname !== "/metrics") { res.writeHead(404, { "Content-Type": "text/plain; charset=utf-8" }); res.end("not found\n"); return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/observability/server.ts` around lines 38 - 53, The createMetricsRequestHandler currently serves metrics for any HTTP method; update the returned handler to only allow GET: inspect req.method at the top of the function and if it's not "GET" respond with 405 Method Not Allowed (include an "Allow: GET" header and a short plain-text body like "method not allowed\n"), otherwise proceed to validate url.pathname and call registry.renderPrometheus() as before; modify the handler returned by createMetricsRequestHandler (which receives IncomingMessage and ServerResponse) to enforce this method check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/monitoring/monitor-sandbox-activity.md`:
- Around line 84-85: Add a MyST warning admonition (:::{warning}) immediately
before the sentence that recommends changing NEMOCLAW_METRICS_HOST to enable
remote scraping, stating that the /metrics endpoint is unauthenticated and
exposing it beyond localhost may leak operational metadata; include scope (which
services/hosts or networks may be affected), risk (what kind of data could be
exposed), and remediation guidance (prefer scraping over a secure network, use
firewall rules, or bind to loopback and proxy/secure the endpoint) and reference
the environment variables NEMOCLAW_METRICS_HOST and NEMOCLAW_METRICS_PORT and
the /metrics path so readers know exactly which setting and endpoint the warning
applies to.
---
Outside diff comments:
In `@nemoclaw/eslint.config.mjs`:
- Around line 1-4: Add the required repository SPDX license header at the top of
this module (above the import statements) so the file includes an
SPDX-License-Identifier line and any required copyright/owner line; update the
top of eslint.config.mjs (the module containing the import statements tseslint,
tsparser, and prettier) to include the standard project header block exactly as
used by other source files.
---
Nitpick comments:
In `@nemoclaw/src/observability/server.ts`:
- Around line 38-53: The createMetricsRequestHandler currently serves metrics
for any HTTP method; update the returned handler to only allow GET: inspect
req.method at the top of the function and if it's not "GET" respond with 405
Method Not Allowed (include an "Allow: GET" header and a short plain-text body
like "method not allowed\n"), otherwise proceed to validate url.pathname and
call registry.renderPrometheus() as before; modify the handler returned by
createMetricsRequestHandler (which receives IncomingMessage and ServerResponse)
to enforce this method check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ff0e53b1-a33c-4999-95bb-ecdb116461cf
📒 Files selected for processing (15)
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.mddocs/monitoring/monitor-sandbox-activity.mdnemoclaw/eslint.config.mjsnemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.test.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/observability/server-start.test.tsnemoclaw/src/observability/server.test.tsnemoclaw/src/observability/server.tsnemoclaw/src/register.test.tssrc/lib/preflight.test.tstest/legacy-path-guard.test.tsvitest.config.ts
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in Prometheus-compatible metrics endpoint to the NemoClaw plugin to improve observability of blueprint execution, endpoint validation, and sandbox lifecycle operations, while keeping metrics disabled by default.
Changes:
- Introduces a zero-dependency
MetricsRegistry(counters + cumulative histograms) and Prometheus text exposition. - Adds an optional
/metricsHTTP server (host/port configurable via env vars) and wires it into plugin registration. - Instruments blueprint plan/apply, endpoint SSRF validation, and sandbox creation; updates tests and monitoring docs accordingly.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
vitest.config.ts |
Gates installer/Brev test projects behind CI/credentials to keep local test runs stable. |
test/legacy-path-guard.test.ts |
Runs the TS guard script via node --import with the TSX loader for portability. |
src/lib/preflight.test.ts |
Makes real port-occupation tests tolerant of restricted environments that block loopback listeners. |
nemoclaw/src/register.test.ts |
Adds coverage for conditional metrics service registration and ensures env/registry cleanup between tests. |
nemoclaw/src/observability/metrics.ts |
Implements metrics registry, operation timing helpers, and Prometheus rendering. |
nemoclaw/src/observability/server.ts |
Adds the /metrics HTTP server and env-based bind resolution. |
nemoclaw/src/observability/server*.test.ts |
Covers request handling, bind resolution, startup/shutdown, and error paths. |
nemoclaw/src/index.ts |
Registers the optional metrics service when enabled. |
nemoclaw/src/blueprint/runner.ts |
Instruments blueprint execution, endpoint validation, and sandbox create lifecycle with metrics. |
nemoclaw/src/blueprint/runner.test.ts |
Adds assertions that runner paths emit expected metrics when enabled. |
nemoclaw/eslint.config.mjs |
Adjusts TypeScript ESLint project-service settings to keep lint viable with colocated tests. |
docs/monitoring/monitor-sandbox-activity.md |
Documents enabling/scraping the Prometheus metrics endpoint and example metrics. |
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.md |
Regenerates agent skill content to include the new metrics workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemoclaw/src/observability/server.ts (1)
21-31: Port validation allows privileged ports (0-1023).This implementation accepts ports 0-65535, while
nemoclaw/src/lib/ports.ts:parsePort()restricts to 1024-65535. The difference is likely intentional here: port 0 enables OS-assigned ephemeral ports (useful for testing), and the 9090 default is unprivileged. However, if consistency with the existing port utility is desired, consider documenting why this differs or aligning the ranges.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/observability/server.ts` around lines 21 - 31, resolveMetricsPort currently validates ports 0–65535 which permits privileged ports; update its validation to accept either 0 (for OS-assigned ephemeral ports used in tests) or the same unprivileged range used by parsePort (1024–65535) to keep behavior consistent; modify the check in resolveMetricsPort to allow parsed === 0 || (Number.isInteger(parsed) && parsed >= 1024 && parsed <= 65535), update the thrown Error message to reflect the new allowed range, and add a brief comment referencing parsePort and the reason 0 is allowed for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemoclaw/src/observability/server.ts`:
- Around line 21-31: resolveMetricsPort currently validates ports 0–65535 which
permits privileged ports; update its validation to accept either 0 (for
OS-assigned ephemeral ports used in tests) or the same unprivileged range used
by parsePort (1024–65535) to keep behavior consistent; modify the check in
resolveMetricsPort to allow parsed === 0 || (Number.isInteger(parsed) && parsed
>= 1024 && parsed <= 65535), update the thrown Error message to reflect the new
allowed range, and add a brief comment referencing parsePort and the reason 0 is
allowed for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6331d7f7-45f1-4989-be5e-24d5d6f38e27
📒 Files selected for processing (9)
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.mddocs/monitoring/monitor-sandbox-activity.mdnemoclaw/eslint.config.mjsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.test.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/observability/server.test.tsnemoclaw/src/observability/server.tsnemoclaw/src/register.test.ts
✅ Files skipped from review due to trivial changes (1)
- docs/monitoring/monitor-sandbox-activity.md
🚧 Files skipped from review as they are similar to previous changes (3)
- .agents/skills/nemoclaw-user-monitor-sandbox/SKILL.md
- nemoclaw/src/observability/server.test.ts
- nemoclaw/src/register.test.ts
0fa4868 to
eb8d3fd
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nemoclaw/src/index.ts (1)
335-369: Extract the metrics-service branch out ofregister().This block is self-contained, but keeping it inline makes
register()even harder to follow alongside provider probing and thebefore_tool_callhook wiring. Moving it to a smallregisterMetricsService(api)helper would keep the entrypoint under the repo’s complexity budget and make future observability changes safer to evolve.As per coding guidelines, TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/index.ts` around lines 335 - 369, The metrics-service block inside register() is self-contained and should be moved to a small helper to reduce cyclomatic complexity: create a function registerMetricsService(api: Api) (or similar) that checks isMetricsEnabled(), calls api.registerService with the same id "nemoclaw-metrics" and the existing start/stop logic that uses startMetricsServer, MetricsServer and the logger, and then call that helper from register(); preserve error handling and the metricsServer variable scoping by keeping it internal to the helper so behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw/src/observability/metrics.ts`:
- Around line 210-216: The HELP text is inserted verbatim in renderPrometheus
which can produce invalid Prometheus exposition when metadata.help contains
backslashes or newlines; update renderPrometheus to escape HELP text by
replacing backslashes with double-backslashes and newlines with '\n' (or
otherwise escaping sequences Prometheus expects) before pushing `# HELP ${name}
${metadata.help}` so that the HELP line uses the escaped help string; target the
renderPrometheus method and the usage of metadata.help when building the lines
array to apply the escaping transformation.
In `@nemoclaw/src/observability/server.ts`:
- Around line 62-66: Wrap the call to registry.renderPrometheus() in a try/catch
inside the request handler so a thrown error does not escape: call
registry.renderPrometheus() in the try, send 200 headers and the body on
success, and on catch log the error (use the project logger like processLogger
or console.error) then respond with a 500 status and a safe plain-text error
body; ensure you don't send headers twice (i.e., only call res.writeHead/res.end
in the try or in the catch branch).
In `@test/legacy-path-guard.test.ts`:
- Line 12: The constant TSX_LOADER currently points to an internal file path;
change its value to the documented public module specifier by replacing the
path.join(REPO_ROOT, "node_modules", "tsx", "dist", "loader.mjs") assignment so
TSX_LOADER is simply "tsx" (update any code that imports/uses TSX_LOADER
accordingly), ensuring the module is resolved via package exports rather than a
hard-coded internal file path.
---
Nitpick comments:
In `@nemoclaw/src/index.ts`:
- Around line 335-369: The metrics-service block inside register() is
self-contained and should be moved to a small helper to reduce cyclomatic
complexity: create a function registerMetricsService(api: Api) (or similar) that
checks isMetricsEnabled(), calls api.registerService with the same id
"nemoclaw-metrics" and the existing start/stop logic that uses
startMetricsServer, MetricsServer and the logger, and then call that helper from
register(); preserve error handling and the metricsServer variable scoping by
keeping it internal to the helper so behavior is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1a13a60d-cc91-443f-bd41-2feccd6e91d5
📒 Files selected for processing (15)
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.mddocs/monitoring/monitor-sandbox-activity.mdnemoclaw/eslint.config.mjsnemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.test.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/observability/server-start.test.tsnemoclaw/src/observability/server.test.tsnemoclaw/src/observability/server.tsnemoclaw/src/register.test.tssrc/lib/preflight.test.tstest/legacy-path-guard.test.tsvitest.config.ts
✅ Files skipped from review due to trivial changes (4)
- nemoclaw/eslint.config.mjs
- nemoclaw/src/observability/server.test.ts
- docs/monitoring/monitor-sandbox-activity.md
- .agents/skills/nemoclaw-user-monitor-sandbox/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (4)
- nemoclaw/src/observability/metrics.test.ts
- vitest.config.ts
- src/lib/preflight.test.ts
- nemoclaw/src/blueprint/runner.ts
eb8d3fd to
df0181b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemoclaw/src/blueprint/runner.ts (1)
293-307:⚠️ Potential issue | 🟡 MinorDouble validation with misleading
sourcelabel when CLI override is used.When
--endpoint-urlis provided, the endpoint is validated twice:
- Line 294: with
source="override"✓- Line 304: with
source="blueprint"— but this validates the same overridden URL, not the blueprint's original endpointThis results in misleading metrics and redundant validation work.
♻️ Suggested fix: skip second validation when override was applied
let inferenceCfg = { ...inferenceProfiles[profile] }; if (endpointUrl) { const validated = await validateEndpointForMetrics(endpointUrl, "override"); // Use DNS-pinned URL for HTTP (full SSRF/rebinding protection). For HTTPS, // keep the original hostname — TLS certificate validation prevents rebinding // since the attacker cannot present a valid cert for the target. const safe = endpointUrl.startsWith("https:") ? validated.url : validated.pinnedUrl; inferenceCfg = { ...inferenceCfg, endpoint: safe }; - } - - // Validate the final endpoint (whether from CLI override or blueprint profile) - if (inferenceCfg.endpoint) { + } else if (inferenceCfg.endpoint) { + // Validate the blueprint endpoint (skip if CLI override already validated) const validated = await validateEndpointForMetrics(inferenceCfg.endpoint, "blueprint"); const safe = inferenceCfg.endpoint.startsWith("https:") ? validated.url : validated.pinnedUrl; inferenceCfg = { ...inferenceCfg, endpoint: safe }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/blueprint/runner.ts` around lines 293 - 307, The code performs validateEndpointForMetrics twice when a CLI override (endpointUrl) is supplied, causing redundant validation and a misleading "blueprint" source; modify the logic around validateEndpointForMetrics and inferenceCfg so that when endpointUrl is provided and used to set inferenceCfg.endpoint you skip the subsequent "blueprint" validation—either by changing the second conditional to only run when no CLI override was applied (e.g., if (!endpointUrl && inferenceCfg.endpoint)) or by using a boolean flag (e.g., overrideApplied) set after validating endpointUrl; ensure references to validateEndpointForMetrics, inferenceCfg, endpointUrl, and the "override"/"blueprint" source labels are adjusted so metrics reflect the true source and the endpoint is validated only once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@vitest.config.ts`:
- Around line 10-11: The current boolean runBrevE2e var is too permissive;
change its logic to require both the secret and an explicit opt-in (or CI) so
cloud E2E won't run just because the token exists—replace the definition of
runBrevE2e with a compound check like: token present AND (explicit opt-in env
var e.g. BREV_E2E === '1' OR process.env.CI === 'true'), and apply the same
stricter gating to the other related checks around the e2e-brev suite referenced
in the block at lines 49-59 (update any uses of runBrevE2e or duplicate token
checks to use this new opt-in+token logic).
---
Outside diff comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 293-307: The code performs validateEndpointForMetrics twice when a
CLI override (endpointUrl) is supplied, causing redundant validation and a
misleading "blueprint" source; modify the logic around
validateEndpointForMetrics and inferenceCfg so that when endpointUrl is provided
and used to set inferenceCfg.endpoint you skip the subsequent "blueprint"
validation—either by changing the second conditional to only run when no CLI
override was applied (e.g., if (!endpointUrl && inferenceCfg.endpoint)) or by
using a boolean flag (e.g., overrideApplied) set after validating endpointUrl;
ensure references to validateEndpointForMetrics, inferenceCfg, endpointUrl, and
the "override"/"blueprint" source labels are adjusted so metrics reflect the
true source and the endpoint is validated only once.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dfc5eb79-5961-4739-8586-7ec4c528008c
📒 Files selected for processing (15)
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.mddocs/monitoring/monitor-sandbox-activity.mdnemoclaw/eslint.config.mjsnemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.test.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/observability/server-start.test.tsnemoclaw/src/observability/server.test.tsnemoclaw/src/observability/server.tsnemoclaw/src/register.test.tssrc/lib/preflight.test.tstest/legacy-path-guard.test.tsvitest.config.ts
✅ Files skipped from review due to trivial changes (3)
- docs/monitoring/monitor-sandbox-activity.md
- nemoclaw/src/register.test.ts
- nemoclaw/src/observability/metrics.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- nemoclaw/src/observability/server.test.ts
- nemoclaw/src/blueprint/runner.test.ts
- nemoclaw/src/observability/server-start.test.ts
debff63 to
ba186cf
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemoclaw/src/blueprint/runner.ts (1)
293-305:⚠️ Potential issue | 🟠 MajorDon’t count an override endpoint as blueprint validation.
With
--endpoint-url, Line 294 recordssource="override", then Lines 303-305 immediately revalidate the same effective endpoint assource="blueprint". That doublesapi_validationand mislabels override traffic, which will skew the success-rate metrics this PR is adding.Suggested fix
let inferenceCfg = { ...inferenceProfiles[profile] }; if (endpointUrl) { const validated = await validateEndpointForMetrics(endpointUrl, "override"); // Use DNS-pinned URL for HTTP (full SSRF/rebinding protection). For HTTPS, // keep the original hostname — TLS certificate validation prevents rebinding // since the attacker cannot present a valid cert for the target. const safe = endpointUrl.startsWith("https:") ? validated.url : validated.pinnedUrl; inferenceCfg = { ...inferenceCfg, endpoint: safe }; - } - - // Validate the final endpoint (whether from CLI override or blueprint profile) - if (inferenceCfg.endpoint) { + } else if (inferenceCfg.endpoint) { const validated = await validateEndpointForMetrics(inferenceCfg.endpoint, "blueprint"); const safe = inferenceCfg.endpoint.startsWith("https:") ? validated.url : validated.pinnedUrl; inferenceCfg = { ...inferenceCfg, endpoint: safe }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemoclaw/src/blueprint/runner.ts` around lines 293 - 305, The code is revalidating an endpoint override as if it came from a blueprint, causing duplicate/mislabelled metrics; update the block that calls validateEndpointForMetrics for the final endpoint so it only runs when there was no CLI override (i.e., only when endpointUrl is falsy). Concretely, in runner.ts adjust the condition around validateEndpointForMetrics/inferenceCfg.endpoint so that if endpointUrl is provided you skip the second call (or otherwise keep using the already-validated value), referencing validateEndpointForMetrics, endpointUrl, and inferenceCfg.endpoint to locate and change the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 339-340: The metrics call in metrics.observeOperation that creates
the "blueprint_execution" series is using the user-controlled label profile
which causes unbounded cardinality; remove profile from the label set passed to
metrics.observeOperation (keep only bounded labels like action and status) where
observeOperation("blueprint_execution", { action: "plan", profile }, () =>
actionPlanImpl(...)) is invoked, and instead ensure the profile value remains
recorded in logs or the plan output (e.g., add it to existing logging around
actionPlanImpl or the returned plan) so you do not lose the profile context.
In `@test/cli.test.ts`:
- Line 310: The per-test Jest timeout is still 15s while runWithEnv("debug
--quick 2>&1", { HOME: home }, 30000) (and the similar call at the other
location) allows the subprocess 30s; update the enclosing tests to increase
their Jest timeout to at least 30000ms (either by adding a timeout argument to
the test(...) call or calling jest.setTimeout(30000) for these tests) so the
test harness won't abort early when runWithEnv takes up to 30s.
In `@test/http-proxy-fix-e2e.test.ts`:
- Around line 274-275: Remove the invalid eslint-disable comments and eliminate
the non-null assertions by adding explicit null checks for the mock object
before accessing its properties (replace occurrences of "mock!.port" and other
"mock!" usages). For each place that currently uses mock! (referenced as mock
and its port property), remove the eslint comment and add a guard such as if
(!mock) throw new Error("expected mock to be initialized") or use a test
assertion like expect(mock).toBeDefined() before accessing mock.port so the code
no longer relies on the non-null assertion operator.
---
Outside diff comments:
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 293-305: The code is revalidating an endpoint override as if it
came from a blueprint, causing duplicate/mislabelled metrics; update the block
that calls validateEndpointForMetrics for the final endpoint so it only runs
when there was no CLI override (i.e., only when endpointUrl is falsy).
Concretely, in runner.ts adjust the condition around
validateEndpointForMetrics/inferenceCfg.endpoint so that if endpointUrl is
provided you skip the second call (or otherwise keep using the already-validated
value), referencing validateEndpointForMetrics, endpointUrl, and
inferenceCfg.endpoint to locate and change the logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9898b24e-fbdb-461b-80b4-c9b663f06837
📒 Files selected for processing (18)
.agents/skills/nemoclaw-user-monitor-sandbox/SKILL.mddocs/monitoring/monitor-sandbox-activity.mdnemoclaw/eslint.config.mjsnemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tsnemoclaw/src/index.tsnemoclaw/src/observability/metrics.test.tsnemoclaw/src/observability/metrics.tsnemoclaw/src/observability/server-start.test.tsnemoclaw/src/observability/server.test.tsnemoclaw/src/observability/server.tsnemoclaw/src/register.test.tssrc/lib/preflight.test.tstest/cli.test.tstest/http-proxy-fix-e2e.test.tstest/legacy-path-guard.test.tstest/runtime-shell.test.tsvitest.config.ts
✅ Files skipped from review due to trivial changes (3)
- nemoclaw/eslint.config.mjs
- docs/monitoring/monitor-sandbox-activity.md
- src/lib/preflight.test.ts
🚧 Files skipped from review as they are similar to previous changes (4)
- nemoclaw/src/observability/server.test.ts
- nemoclaw/src/observability/server-start.test.ts
- .agents/skills/nemoclaw-user-monitor-sandbox/SKILL.md
- vitest.config.ts
Signed-off-by: Ho Lim <subhoya@gmail.com>
ba186cf to
b0fa6ac
Compare
|
✨ Thanks for submitting this pull request that proposes a way to add observability metrics to NemoClaw. This identifies an enhancement to the plugin's capabilities and proposes a change to include a Prometheus metrics endpoint. Related open issues: |
|
Thanks for the implementation. We do not think we want NemoClaw core to accumulate one-off integrations/exporters like Prometheus directly. The direction we prefer is a generic observability plugin extension point where core emits stable lifecycle/operation events, and external plugins can provide adapters such as Prometheus, OpenTelemetry, Datadog, etc. We are going to close this PR in favor of tracking that architecture explicitly. A Prometheus adapter would be a good first consumer once that extension point exists, but we do not think the Prometheus HTTP endpoint should live directly in the main repo/core path. |
|
Closing in favor of #3915, which tracks the generic observability plugin support path. |
Summary
Adds an opt-in, zero-dependency Prometheus metrics endpoint for NemoClaw plugin observability. The metrics path is disabled by default and records blueprint execution, endpoint validation, and sandbox lifecycle timings only when
NEMOCLAW_METRICS_ENABLED=true.Related Issue
Closes #233
Changes
MetricsRegistrywith counters, cumulative histograms, Prometheus exposition formatting, andobserveOperationhelpers./metricsHTTP service from the plugin when metrics are enabled, withNEMOCLAW_METRICS_PORTandNEMOCLAW_METRICS_HOSToverrides.plan/apply, endpoint validation, and sandbox create lifecycle paths without changing default behavior.npm teststable by registering installer and Brev projects only when their CI/credential gates are present, and by making existing network-listener tests tolerate sandboxed environments that reject loopback listeners.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Ho Lim subhoya@gmail.com
Summary by CodeRabbit
New Features
Documentation
Tests