Skip to content

Resolve discrepancy between query types included in latency and QPS metrics #35794

@celiala

Description

@celiala

Objective

Resolve discrepancy between our QPS and latency metrics (as well as resurface the P50 latency metric).

Problem Statement

For 19.1, while we didn't fully address all the concerns raised from #283 and #23967, PR #35693 at least provided a short-term fix for QPS metric inconsistencies by having all QPS metrics only use SELECT/UPDATE/INSERT/DELETE query types.

However, this leaves us with a discrepancy between our QPS and latency
metrics:

  • All QPS metrics on AdminUI now just include SELECT/UPDATE/INSERT/DELETEs.
  • All latency metrics on AdminUI include all queries (not just
    SELECT/UPDATE/INSERT/DELETEs, but also DDL & transaction statements).

This raises the primary concern of potentially misleading user:

  • By not surfacing information about the other query types, users can
    potentially be blinded to the database/cluster being overload.
  • Clients may see metrics on their application being higher then what we are
    reporting

Because of this discrepancy, we removed P50 latency from summary bar in PR to
avoid user questions/confusion:

  • A large number of complex transactions or SET commands could really skew the
    P50 latency, which could cause user confusion/questions.
  • Based on our observations from a support perspective (Tim, Ron), we don't
    see customers concerned about that number.
  • Talking with @rkruze , however, the P50 latency metric is nice for POC to
    quickly show the performance of the system. While we agreed that removing
    this for now would be best, Roko expressed wanting P50 latency put back
    for 19.2.

Exploring Potential Solutions

In order to resolve this discrepancy between our QPS and latency
metrics (as well as resurface the P50 latency metric), we would need to find
a solution that either 1) modifies the latency metrics to match the query types
used by the QPS metrics, or 2) surfaces the remaining query types for QPS
metrics.

Option 1: Modify latency metrics to only include query types used by the QPS
metrics.

Pros:

  • Latency and QPS metrics match
  • We could then resurface P50 latency metric, which is nice for POC to
    quickly show the performance of the system.

Cons:

  • This could midlead user/give wrong impression. Clients may compare our metrics
    with metrics from their application and see their average latency being much
    higher then what we are reporting
  • This would require a non-trivial amount of work. We can't easily modify the
    way we collect latency metrics (it's doing a fair amount of aggregation, so
    to pull out only certain query types would require work). (I think I've
    captured this right, but feel free to correct/elaborate as needed cc @raphael).

Option 2: Surface remaining query types for QPS metrics.

One idea for implementing this is to:

  • [Piece 1] Add an "Other" bucket to the "SQL Queries" time series chart, and
  • [Piece 2] Add this "Other" bucket to the Summary Bar somehow.
  • [We would/should also adjust the Node Map to match, which would be
    easy to do].

Pros:

  • (same as above)

Cons:

  • If user gets in a situation where the "other" category is extremely large,
    this could make it hard to see the other time series.
  • This would also require some design thought & dev work (see below)

[Piece 1] Add an "Other" bucket to the "SQL Queries" time series chart

What we probably want [on the UI-side] is a way to take multiple
time-series metrics, perform a custom aggregation, then output this
as a single time-series metric for a time series chart. However,
implementing this doesn't seem quick/straight-forward, based on the
way we currently fetch and expose this data [TODO - do more
investigation here].

[Piece 2] Add an "Other" bucket to the "SQL Queries" time series chart

This requires some design thought: how should we label this QPS metric
that just sums up "Other"; and should we change the label for
the QPS metric that currently sums up the SELECT/UPDATE/INSERT/DELETEs?
And what would that latter label be (see example image):

QPS - how to label

Jira issue: CRDB-4557

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-webui-timeseriesC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.X-staleno-issue-activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions