Skip to content

reporting: capture scrubbed information about queries and schemas #13968

@knz

Description

@knz

To gather information to guide work on optimizations, and to track which features are used, we want to receive more information about SQL queries from production clusters.

Here's how it looks like:

  • reporting is enabled by default, users can opt-out
  • server will push information to our registration server
  • what information is pushed: SQL query distribution & database schemas.
    • from every node in the cluster
    • at a predefined interval (say, once per day) this info is encoded, compressed, sent to our reg server, then cleared on the node
  • regarding "SQL query distribution": during execution, each node maintains internally (in memory) a map from "query structure" (see below for definition) to a tuple
    • tuple: (count, first/last latency quantiles, nr rows, error status)
    • count = how many times that structure was encountered
    • first latency = from the point the query is received on pgwire to the point the first result is returned
    • last latency = from the point the query is received on pgwire to the point the last result is returned
  • SQL query structure: either SQL query syntax or logical plan (or both) are used as a key in the distribution map defined above, with a twist:
    • all SQL Datums are replaced by placeholders;
    • all db/table/column names are replaced by their numeric IDs
      (We work under the assumption that applications always reuse a small set of query structure, but with a large variability on the constant parameters)
  • database schemas are also reported, alongside with the distribution map defined above
    • reports table IDs together with types, constraints, index definitions etc (names scrubbed away)
    • default/check expressions also scrubbed from datums & names; all names replaced by equivalent IDs
  • CLI utility or web UI allows users to inspect the data that is being collected.

Proposed approach by Peter:

Bonus features:

  • make a function that scrubs a schema
  • make a json endpoint in the debug interface
  • make a CLI utility that dumps it
  • submit data to reporting server + UI to opt-out

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions