feat: implement diagnostic channels for observability#3195
feat: implement diagnostic channels for observability#3195logaretm wants to merge 85 commits intoredis:masterfrom
Conversation
…sh and receive paths
ce03e84 to
0a0e64d
Compare
TracingChannel ProposalRemove duplicate COMMAND_REPLY publish from sendCommand — typed commands already publish via _executeCommand. Also merge the two separate COMMAND_REPLY subscribers (pubsub out + streaming) into a single #subscribeCommandReply method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the acquire timeout fires, rejectWait() is called so the TracingChannel error channel fires — APMs see the timeout as an error instead of an orphaned span that never closes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1d34951 to
a80bfc4
Compare
When OTel is not initialized or disabled, skip instrument registration and subscriber creation entirely. No noop instruments needed — if there are no subscribers, channels are never fired. Removes 139-line noop-meter.ts and all noop fallback branches from instrument creation helpers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a80bfc4 to
e13cd3e
Compare
- COMMAND_REPLY in _executeCommand now passes sanitizeArgs(parser.redisArgs) instead of raw args, preventing sensitive values from leaking to subscribers - Pool connection wait trace promise gets .catch(noop) to prevent unhandled rejection if a tracing subscriber throws Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When pool.destroy() is called with tasks still queued, call rejectWait() on each pending task so the TracingChannel error channel fires and APM subscribers close their spans cleanly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e6fc460 to
c6c6695
Compare
- Tracing test: check isOpen before destroying already-closed client - OTel E2E tests: rewrite 5 tests that called removed methods (resiliencyMetrics.recordClientErrors, streamMetrics.recordStreamLag) to publish via channels instead - Connection closed metric: split subscriber for basic (connection count) and advanced (close reason) so tests enabling only one group see correct metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original code always recorded wait time, even for 0ms waits when a client was available. The TracingChannel refactor only traced the "no client, must wait" path. Add trace for immediate availability so the metric is always emitted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
362e1f9 to
c7d824c
Compare
ce03e84 to
da0139d
Compare
…or event Remove redundant publish(CHANNELS.ERROR) from sendCommand — the TracingChannel already emits error events for command failures. Wire OTel resiliency subscriber to commandTC.error for command-level error counts; CHANNELS.ERROR now only carries cluster/internal errors. Extract subscribeTC() helper to reduce TracingChannel subscription boilerplate and #recordError() to share error metric recording logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| fn | ||
| fn, | ||
| resolveWait: resolveWait!, | ||
| rejectWait: rejectWait!, |
There was a problem hiding this comment.
Pool trace uses potentially uninitialized resolveWait/rejectWait when no acquire timeout
Medium Severity
When acquireTimeout is 0 (or negative), no setTimeout is created, so rejectWait is never called on timeout. But more critically, the resolveWait and rejectWait variables are assigned inside the trace() callback's inner Promise constructor. If trace() short-circuits (no tracing channel available or no subscribers), fn() is called directly, and the Promise executor still runs synchronously — so the variables get assigned. However, if trace() takes the tracing path via tracePromise, the assignment still happens synchronously inside the Promise constructor. The real issue is that the task object stores resolveWait! and rejectWait! with non-null assertions, but if the trace call itself throws synchronously before executing fn(), these would remain uninitialized, and task.resolveWait() in #returnClient would crash.
| { regex: /^(LPUSH|MSET|PFA|PUBLISH|RPUSH|SADD|SET|SPUBLISH|XADD|ZADD)/i, args: 1 }, | ||
| { regex: /^(HSET|HMSET|LSET|LINSERT)/i, args: 2 }, | ||
| { regex: /^(ACL|BIT|B[LRZ]|CLIENT|CLUSTER|CONFIG|COMMAND|DECR|DEL|EVAL|EX|FUNCTION|GEO|GET|HINCR|HMGET|HSCAN|INCR|L[TRLM]|MEMORY|P[EFISTU]|RPOP|S[CDIMORSU]|XACK|X[CDGILPRT]|Z[CDILMPRS])/i, args: -1 }, | ||
| ]; |
There was a problem hiding this comment.
Sanitization regex prefix-matches commands too broadly causing over-redaction
Medium Severity
The SERIALIZATION_SUBSETS regexes use prefix matching (e.g., ^SET, ^HSET) without end anchors, so commands like SETNX match the args: 1 rule for SET, and SETRANGE also matches. More importantly, SUBSCRIBE starts with S and matches the args: -1 rule via S[CDIMORSU] before being checked, which is correct. However, SINTERCARD would match S[CDIMORSU] (the SI prefix), getting args: -1 and exposing all args. Meanwhile the PFA prefix in args: 1 would match PFADD (correct) but also match PFCOUNT and PFMERGE which are read-only commands — those should have args: -1 but instead get args: 1, incorrectly redacting their arguments.
| } | ||
| } else { // 3/3a | ||
| this.#statsCounter.recordMisses(1); | ||
| publish(CHANNELS.CACHE_REQUEST, () => ({ result: 'miss', clientId: client._clientId })); |
There was a problem hiding this comment.
Cache publish calls have wrong indentation level in conditional blocks
High Severity
The publish(CHANNELS.CACHE_REQUEST, ...) calls inside handleCache appear to have incorrect indentation that places them at the wrong nesting level. At line 561, the publish for cache hit is indented to the same level as the if (cacheEntry instanceof ClientSideCacheEntryValue) block rather than inside it. Similarly at lines 567 and 574. While JavaScript/TypeScript ignores indentation for execution, checking the actual brace structure reveals these calls execute inside the correct if/else branches — the indentation is misleading but the logic is correct. However, at line 561 the publish call is placed after recordHits(1) but before the return structuredClone(...), so if publish throws (subscriber error), the cache hit value is never returned.


Actual Diff is here: https://github.com/logaretm/node-redis/pull/1/changes
This is currently waiting on #3110 to get merged and then it will show the true size of changes
Re-implements otel metrics tracing using diagnostic and tracing channels, while setting up more channels for tracing.
The key shift: core code now describes what is happening, not what to compute.
Before, every instrumented code path had to know about OTel, creating closures, capturing timestamps, building attribute maps, recording histograms. The code was telling the system how to measure itself. And every metric class needed a noop counterpart for when its metric group was disabled.
Now core code says "a command started," "a connection closed," "an error occurred." Subscribers like OTel, Sentry, a debug logger, anything can independently decide what to compute from those events.
The domain knowledge in #3110 (which metrics to emit, which semantic conventions to follow, how to classify errors) is preserved. It just moved from being inline in the hot path to being a subscriber that opts in when needed.
Noops are gone. When a metric group is disabled, no subscription is created. No subscribers means
hasSubscribersis false, which meanstrace()andpublish()skip the channel entirely, zero cost by design, not by maintaining parallel noop implementations. Bothnoop-metrics.tsandnoop-meter.tsare deleted — when OTel is not initialized, noOTelMetricsinstance is created, no instruments are registered, no subscriptions exist.Two functions for all observability:
trace()— wraps async operations viaTracingChannel(commands, batches, connections, pool waits)publish()— emits point events viadc.channel()(errors, cache hits/misses, pubsub, maintenance, connection lifecycle)Both use factory callbacks +
hasSubscriberschecks for zero overhead when no APM subscribes. This is true for metrics, if the user never calls the opentelemetry plugin then no channel subscriptions occur, and no object allocation happens compared to what we had before.Channels
TracingChannel (async lifecycle: start/end/asyncStart/asyncEnd/error)
This sets up tracing with spans in the future, but still can be used by metrics or logs.
node-redis:commandnode-redis:batchnode-redis:connectnode-redis:connection:waitSingle instance events
This is what metrics usually use, not suitable for traces but can also be used by logs.
node-redis:connection:readynode-redis:connection:closednode-redis:connection:relaxed-timeoutnode-redis:connection:handoffnode-redis:errornode-redis:maintenancenode-redis:pubsubnode-redis:cache:requestnode-redis:cache:evictionnode-redis:command:replyWhat this enables
Any APM can subscribe without monkey-patching or depending on OTel:
Key Changes
OTelMetricsimports in core codenoop-metrics.tsandnoop-meter.tsdeletedtracing.ts): +280 linesCHANNELSconst mapSetting Up OTEL Tracing
Almost Everything Is Already Done
ctx.argsis already sanitized bysanitizeArgs(), sodb.query.textis just a simple join (e.g., SET mykey ?).getErrorInfo()already extracts errorType, category, and statusCode for span attributes.Here is an example of tracing for command calls:
Note that the exact implementation can differ since we may need to use
bindStoreon theAsyncLocalStorageinstance inside OTEL itself, but should be straight forward.Note
Medium Risk
Touches core client hot paths (command execution, socket lifecycle, pooling, cluster redirections) to emit diagnostics/tracing events and register clients for metrics, which could affect performance or lifecycle edge-cases if instrumentation is mis-wired. Changes are additive and gated by subscriber checks, but span multiple critical runtime components.
Overview
Adds a diagnostics-based observability layer: new
lib/client/tracing.tsexposestrace()(TracingChannel-wrapped async ops) andpublish()(point events) with argument sanitization, and the client now emits these events aroundconnect(),sendCommand(),MULTI/pipeline execution, pool wait time, pubsub messages, cache hits/misses/evictions, maintenance notifications, and connection lifecycle.Introduces client identity/registry plumbing for metrics attribution: new
identity.tsgenerates stable, boundedclientIds and roles (standalone/cluster/pool/etc),RedisSocket/PubSub/CommandsQueuenow carryclientId, and clients register/unregister with a newClientRegistryso OpenTelemetry metrics can observe pending requests, cache size, and connection state.Exposes
OpenTelemetry.init()from@redis/clientand adds docs + anotel-metricsexample; updates dependencies/lockfile for@opentelemetry/api/@opentelemetry/sdk-metrics, and adds tests covering tracing sanitization, identity, and client registry behavior.Written by Cursor Bugbot for commit c7d824c. This will update automatically on new commits. Configure here.