Skip to content

kv,*:state inspection pages for a cluster node #66772

@sumeerbhola

Description

@sumeerbhola

(This is a tracking issue for discussion of specific ideas that can be spun off into separate issues)

We lack inspectz-style pages (google terminology) on a node, which would show a view on the current state of certain data-structures within a node. These would be used when metrics or traces have indicated that we need to look more closely at a particular node.

Possible examples: states of (explicit or implicit) queues (e.g. for queues for latches and locks) including who is waiting and for how long; current LSM state and ongoing compactions etc. These don’t need to be fast to generate since they would be used sparingly (in the worst case could take a few seconds, if the internal structure is large, and cause a few ms delay in running queries). Such pages can use filters to make the inspected data manageable e.g. filtered to a range, txnid, key range etc.

This was less important when debug.zip was the primary way to troubleshoot, but we have direct access in CC and for important customers for whom extremely short remediation time is critical.

Needless to say, deciding what state needs such a page is critical and needs to be informed by actual troubleshooting experience. The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).

@jbowens raised the following in the internal slack thread:
is there some risk in having separate observability regimes for clusters that we have direct access to versus not? maybe we should think of a debug.zip as just a response format. if interacting directly with an inspectz-style UI, the UI requests a thin, filtered debug.zip of just the requested data. otherwise, we can instruct customers to generate a debug.zip with the same information, which may be loaded into the same UI

Jira issue: CRDB-8222

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions