Skip to content

feat(API): add /diagnostics endpoint for system-wide debug snapshot#1627

Merged
TheLastCicada merged 17 commits into
v2-rc2from
feat/diagnostics-endpoint
May 15, 2026
Merged

feat(API): add /diagnostics endpoint for system-wide debug snapshot#1627
TheLastCicada merged 17 commits into
v2-rc2from
feat/diagnostics-endpoint

Conversation

@TheLastCicada

@TheLastCicada TheLastCicada commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new top-level GET /diagnostics endpoint that returns a single JSON object summarizing CADT, Chia, DataLayer, and host machine state — intended for sysadmins and users debugging a CADT install. Mounted as a sibling of /health (not under /v1 or /v2); branched off v2-rc2.

What the response includes

  • CADT: version, config directory + file, V1/V2 database paths, datalayer URL + file-server URL, simulator flag, per-version { enabled, readOnly, isGovernanceBody, apiKeyConfigured, governanceBodyId, homeOrgId }.
  • Chia network: configured (from CADT config) vs actual (from wallet get_network_info) and a matches flag.
  • Wallet: rpcUrl, reachable, connectionError (string when unreachable, including for AggregateErrors with empty messages), synced, balanceXch, pending-transactions block (reuses wallet-health.js shaping), trusted-peer cross-reference against wallet.trusted_peers from chia's config.yaml.
  • Full node: best-effort mTLS call to local full-node RPC (get_blockchain_state + get_connections) — degrades to reachable: false if the certs aren't present.
  • DataLayer: reachability + subscriptions list with per-store generation/target-generation and a synced flag (Number.isFinite guard so undefined === undefined doesn't falsely report synced); bounded-concurrency worker pool with a 30 s wall-clock budget that emits truncated: true if cut off.
  • Services flags: aggregate walletReachable / fullNodeReachable / datalayerReachable.
  • chia-tools probe: chia-tools --version (then version as fallback) with a 5 s timeout; distinguishes "not installed" from "installed but broken".
  • Process scan: POSIX ps -eo scan for chia binaries with multi-version detection; reports supported: false on Windows.
  • System: platform/arch, CPU model + cores (via os.cpus()), total/free RAM, free/total disk on the partition holding ${CHIA_ROOT} (via fs.promises.statfs).

Design properties

  • Degrades gracefully: every external call is wrapped in settle(label, producer, timeoutMs) that races against a hard wall-clock deadline. Failures become { ok: false, error } rather than throwing; one wedged subsystem can't block the rest of the response. Worst-case wall-clock ~30 s; healthy responses come back in well under a second.
  • Survives "the wallet is down" / "migrations are slow": /diagnostics lives in HEALTH_ENDPOINTS, and isHealthEndpoint skips were added to the wallet-synced, home-org-synced, and all-data-synced header middlewares so the endpoint isn't blocked by the wallet RPC's 300 s socket timeout or by waitForMigrations() when the DB layer itself is the broken subsystem.
  • Auth: enforced by the existing global API-key middleware. No duplicate check needed.
  • READ_ONLY: response is reduced to non-sensitive public fields (CADT version, configured network, system info, chia-tools probe, process scan) and short-circuits before the wallet/datalayer/full-node RPC fan-out — matches the precedent in src/routes/wallet-health.js so a public observer node doesn't make per-request authenticated wallet RPC calls on unauthenticated public hits.

Files

File Purpose
src/routes/diagnostics.js Response builder, settle, collectSubscriptions, buildTrustedPeerView, normalizeNodeId
src/utils/system-info.js CPU/RAM (os) + disk (fs.promises.statfs)
src/utils/chia-process-scan.js POSIX ps scan + chia-binary regex
src/utils/chia-tools-probe.js chia-tools --version / version fallback, extractVersion regex
src/datalayer/fullNodeRpc.js mTLS client for get_blockchain_state / get_connections
src/datalayer/wallet.js adds getWalletConnections() for trusted-peer cross-reference
src/middleware.js mounts GET /diagnostics, adds it to HEALTH_ENDPOINTS, adds health-endpoint skip to three header middlewares
tests/v2/integration/diagnostics.spec.js HTTP integration tests (22 cases)
tests/v2/integration/diagnostics-helpers.spec.js Pure unit tests for the helpers (24 cases)
tests/v2/live-api/wallet-health.live.spec.js Live-API smoke test against a real CADT install

Test plan

  • npm run test:v1 — 146 passing, 0 failing
  • npm run test:v2 — 1619 passing, 0 failing (includes 46 new diagnostics tests)
  • Pre-push multi-persona review (10 reviewers in parallel: senior-engineer, simplicity/reuse, scope/tone/correctness, domain-context, root-cause/layer, concurrency/cancellation, protocol/schema, observability/diagnostics, test-quality/coverage, generic pre-push-diff-review). All critical findings addressed; warnings/info items either fixed or deliberately deferred with rationale.
  • Live-API smoke check via npm run test:v2:live:wallet-health against a real CADT + Chia install (runs in CI; the live test added a /diagnostics describe block that asserts response shape, types, and reasonable value ranges without pinning specific environment values).

Matrix coverage (READ_ONLY × API key configured × key provided)

READ_ONLY API key Key in request Expected Test
false no n/a 200 full basic-shape tests
false yes correct 200 full accepts requests with the correct x-api-key header
false yes wrong (length mismatch) 403 rejects requests with a wrong-length x-api-key header
false yes wrong (equal length) 403 rejects requests with a wrong-but-equal-length x-api-key header
false yes none 403 rejects unauthenticated requests when CADT_API_KEY is configured
true no n/a 200 reduced public observer node (READ_ONLY=true, no API key): returns 200 with reduced fields without auth
true yes correct 200 reduced READ_ONLY=true + API key configured + correct key: returns 200 with reduced fields
true yes wrong 403 READ_ONLY=true + API key configured + wrong key: rejects with 403
true yes none 403 READ_ONLY=true + API key configured + no key provided: rejects with 403

Manual smoke check (suggested for reviewers)

# Healthy install
curl -s -H "x-api-key: \$CADT_API_KEY" http://localhost:31310/diagnostics | jq .

# With wallet stopped (verify graceful degradation)
chia stop wallet
curl -s -H "x-api-key: \$CADT_API_KEY" http://localhost:31310/diagnostics | jq '.chia.wallet'
# Expect: { reachable: false, connectionError: "Wallet RPC at ... did not respond", ... }
chia start wallet

Note

Medium Risk
Adds a new unauthenticated-by-default (unless API key configured) diagnostics surface that executes local process/CLI probes and multiple RPC calls; mistakes could leak sensitive operational data or add startup/runtime latency despite timeouts and read-only gating.

Overview
Adds a new top-level GET /diagnostics endpoint that returns a single JSON snapshot of CADT config/state plus Chia wallet/full-node/datalayer reachability and host system metrics, with per-section ok/warning/critical status aggregation and hard timeouts to degrade gracefully.

To support this, introduces new helpers for OS info (system-info.js), local Chia process scanning (chia-process-scan.js), chia-tools detection (chia-tools-probe.js), and a minimal mTLS full-node RPC client (fullNodeRpc.js), plus extends wallet.js with get_version and get_connections helpers.

Middleware now treats /diagnostics as a health-style endpoint (bypassing rate limits, startup gates, and synced-header probes) while explicitly blocking the route when READ_ONLY is enabled; startup additionally logs a fire-and-forget diagnostics snapshot after migrations complete. Tests add broad unit/integration/live-api coverage for response shape, auth gating, and helper edge cases.

Reviewed by Cursor Bugbot for commit 28b0a1e. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds GET /diagnostics, a single JSON endpoint that summarizes the running
CADT, Chia, DataLayer, and host machine state. The endpoint is intended for
sysadmins debugging a CADT install: it reports CADT version, configured /
actual chia network, wallet/full-node/datalayer reachability + sync status,
wallet balance, trusted-peer cross-reference, DataLayer subscriptions with
per-store sync status, governance body IDs, V1/V2 home org IDs, datalayer
URLs, CADT config + database paths, CPU/RAM/disk numbers, a chia process
scan, and a chia-tools probe.

The endpoint is mounted on the root app (not /v1 or /v2) and lives in
HEALTH_ENDPOINTS so it bypasses the rate limiter, startup gates, and the
chia/datalayer assertions -- the whole point of a diagnostics endpoint is
to be useful when those subsystems are broken. Every external call goes
through a settle() wrapper with a per-call timeout and Promise.all fan-out,
so one slow or wedged RPC can't block the rest of the response. Worst-case
wall-clock is ~30s (subscription enumeration budget); healthy responses
come back in well under a second.

Authentication is enforced by the existing global API-key middleware (no
duplicate check needed). When V1 or V2 READ_ONLY is set, the response is
reduced to non-sensitive public data and short-circuits before the wallet
/ datalayer RPC fan-out, matching the precedent in wallet-health.js.

Also adds isHealthEndpoint() skips to the wallet-synced, home-org-synced,
and all-data-synced header middlewares so /diagnostics (and /health*)
don't hang on the wallet RPC's 300s socket timeout or on waitForMigrations
when the database layer is the broken subsystem.
Comment thread src/routes/diagnostics.js
Comment thread src/routes/diagnostics.js Outdated
…en network match

Two fixes from bugbot review of #1627:

1. Move dynamic imports of wallet / fullNodeRpc / persistance / V1+V2
   models / fullNode AFTER the read-only short-circuit. The PR description
   said the read-only path short-circuits before fetch, but the heavy
   imports were happening unconditionally, which (a) made public observer
   nodes load DB and wallet modules they don't need and (b) caused
   /diagnostics to fail in the exact scenarios it's designed to survive
   (e.g. V2 model module body failing to initialize).

2. Use exact equality (not substring containment) for the network match.
   `"testnet10".includes("testnet1")` is true, so the previous code
   reported a false match when the configured network was a prefix of
   the actual one. The diagnostics endpoint's job is to tell the truth;
   the existing assertChiaNetworkMatchInConfiguration still uses the
   substring rule, but the `actual` and `configured` fields in the
   response let operators spot whether the loose-match assertion would
   also have considered them equivalent.
The previous test only logged the response keys, which is useless for
actually inspecting what /diagnostics returns. Pretty-print the full JSON
body so the response is visible in every live-api workflow log -- there
is no other sanitized artifact to eyeball the endpoint's real-environment
output.
Comment thread src/utils/system-info.js Outdated
Defensive fix from bugbot review of #1627. Node's fs.statfs() currently
only exposes bsize (libuv's uv_statfs_t doesn't copy frsize from the
underlying statfs(2) syscall), and on Linux+ext4 'blocks * bsize'
matches 'df -B1' byte-exactly. But POSIX statvfs denominates blocks in
frsize, not bsize, and on exotic filesystems (e.g. VirtioFS on Docker)
the two can differ. Using 'stats.frsize ?? stats.bsize' is a zero-cost
hedge that automatically picks up frsize if a future Node version or a
polyfill exposes it, while keeping today's behavior unchanged on every
mainstream environment.
Comment thread src/routes/diagnostics.js
…tics

CADT's CHIA_NETWORK config is a binary mainnet-vs-testnet flag, not an
exact chia network name. Cross-referenced against every use in the
codebase:

  - defaultConfig.js sets it to 'mainnet'
  - config-loader.js forces it to the literal string 'testnet' when
    USE_SIMULATOR=true, regardless of the actual underlying network
  - coin-management.js branches on '=== mainnet ? XCH : TXCH'
  - data-assertions.js accepts any chia network whose name contains
    CHIA_NETWORK as a substring

The CI run on commit 286fcbb confirmed the previous strict-equality
check gave the wrong answer in the real world: chia reported 'testneta',
CADT config was 'testnet', diagnostics reported matches:false even
though CADT itself treats them as a match.

Normalize both sides to mainnet|testnet before comparing. This is both
strictly more correct than the original substring rule (no
testnet1/testnet10 false positive) and operationally aligned with how
the rest of CADT interprets CHIA_NETWORK.

The existing assertChiaNetworkMatchInConfiguration still uses the
substring rule -- harmonising the assertion with this normalised
comparison is a sensible follow-up but is left out of this PR to keep
the scope tight.
Comment thread src/datalayer/fullNodeRpc.js Outdated
The syncMode ternary used a truthy check for state.sync?.synced while
the synced field on the same line used === true. A non-boolean truthy
value (e.g. 1) would produce synced: false alongside syncMode: 'synced'.
Comment thread src/middleware.js Outdated
Comment thread src/routes/diagnostics.js
- Promote network match to top-level "network" key with "chia"/"cadt"
  sub-keys instead of "chia.network.actual"/"configured"
- Rename "processes" to "runningProcesses", drop redundant "supported"
  and "platform" fields (already in system section)
- Add "percentUsed" to both memory and disk sections
- Rename disk "path" to "chiaRootPath" for clarity
- Add "runningLocally" to fullNode section; skip full-node RPC calls
  when process scan confirms no local chia_full_node (falls back to
  optimistic probe when scan is unsupported or fails)
The diagnostics route handler used strict === true while the rest of
middleware uses truthy checks (|| false). A non-boolean truthy config
value (e.g. 1, "true") would bypass the read-only protection and
serve the full response with sensitive fields.
Comment thread src/routes/diagnostics.js
Query the wallet RPC get_version endpoint and surface the installed
Chia version in chia.version (null when the wallet is unreachable).
Remove the internal try/catch so RPC failures surface in the settle()
debug log, consistent with every other wallet RPC function.
scanReliable only checked for processesValue.note (Windows) but not
processesValue.error (ps failure in containers). A caught ps failure
returns { matches: [], error } without a note, causing the scan to be
incorrectly treated as reliable and skipping full-node RPC probes.
Comment thread src/routes/diagnostics.js
getWalletConnections catches errors internally and returns
{ success: false, error } instead of throwing, so settle() always
sees ok: true. Check value.success before treating connections as
valid, and surface the inner error when the wallet call failed.
Comment thread src/datalayer/wallet.js
Comment thread src/datalayer/fullNodeRpc.js
…eRpc

Restore try-catch in getChiaVersion so it is safe for callers outside
of settle(). Add explicit NODE_TLS_REJECT_UNAUTHORIZED to fullNodeRpc.js
so it does not rely on wallet.js import order as a side effect.
…pping fields

Return 403 when READ_ONLY is set rather than serving a reduced
response. Removes buildReadOnlyResponse and the readOnly parameter
from getDiagnosticsResponse since the middleware now gates access.
Comment thread src/routes/diagnostics.js
getWalletHealthResponse already calls walletIsSynced internally, so
the parallel direct call doubled RPC load and raced on the module-level
lastWalletSyncError variable (each call clears it to null). Derive
synced from the wallet health response instead.
Add ok/warning/critical status with messages to diagnostics sections:
disk, memory, cpu, chiaTools, datalayer, fullNode, wallet, network.
Remove redundant services section. Log full diagnostics JSON at startup
(fire-and-forget) so READ_ONLY nodes also have a baseline snapshot.
Comment thread src/middleware.js
getDiagnosticsResponse handles all subsystem failures internally via
settle(), so an exception reaching the outer catch is a genuine bug
that should be surfaced to monitoring tools as a 500.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 28b0a1e. Configure here.

Comment thread src/routes/index.js
@TheLastCicada TheLastCicada merged commit 6e1636c into v2-rc2 May 15, 2026
29 checks passed
@TheLastCicada TheLastCicada deleted the feat/diagnostics-endpoint branch May 15, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant