-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Resolve discrepancy between query types included in latency and QPS metrics #35794
Description
Objective
Resolve discrepancy between our QPS and latency metrics (as well as resurface the P50 latency metric).
Problem Statement
For 19.1, while we didn't fully address all the concerns raised from #283 and #23967, PR #35693 at least provided a short-term fix for QPS metric inconsistencies by having all QPS metrics only use SELECT/UPDATE/INSERT/DELETE query types.
However, this leaves us with a discrepancy between our QPS and latency
metrics:
- All QPS metrics on AdminUI now just include SELECT/UPDATE/INSERT/DELETEs.
- All latency metrics on AdminUI include all queries (not just
SELECT/UPDATE/INSERT/DELETEs, but also DDL & transaction statements).
This raises the primary concern of potentially misleading user:
- By not surfacing information about the other query types, users can
potentially be blinded to the database/cluster being overload. - Clients may see metrics on their application being higher then what we are
reporting
Because of this discrepancy, we removed P50 latency from summary bar in PR to
avoid user questions/confusion:
- A large number of complex transactions or SET commands could really skew the
P50 latency, which could cause user confusion/questions. - Based on our observations from a support perspective (Tim, Ron), we don't
see customers concerned about that number. - Talking with @rkruze , however, the P50 latency metric is nice for POC to
quickly show the performance of the system. While we agreed that removing
this for now would be best, Roko expressed wanting P50 latency put back
for 19.2.
Exploring Potential Solutions
In order to resolve this discrepancy between our QPS and latency
metrics (as well as resurface the P50 latency metric), we would need to find
a solution that either 1) modifies the latency metrics to match the query types
used by the QPS metrics, or 2) surfaces the remaining query types for QPS
metrics.
Option 1: Modify latency metrics to only include query types used by the QPS
metrics.
Pros:
- Latency and QPS metrics match
- We could then resurface P50 latency metric, which is nice for POC to
quickly show the performance of the system.
Cons:
- This could midlead user/give wrong impression. Clients may compare our metrics
with metrics from their application and see their average latency being much
higher then what we are reporting - This would require a non-trivial amount of work. We can't easily modify the
way we collect latency metrics (it's doing a fair amount of aggregation, so
to pull out only certain query types would require work). (I think I've
captured this right, but feel free to correct/elaborate as needed cc @raphael).
Option 2: Surface remaining query types for QPS metrics.
One idea for implementing this is to:
- [Piece 1] Add an "Other" bucket to the "SQL Queries" time series chart, and
- [Piece 2] Add this "Other" bucket to the Summary Bar somehow.
- [We would/should also adjust the Node Map to match, which would be
easy to do].
Pros:
- (same as above)
Cons:
- If user gets in a situation where the "other" category is extremely large,
this could make it hard to see the other time series. - This would also require some design thought & dev work (see below)
[Piece 1] Add an "Other" bucket to the "SQL Queries" time series chart
What we probably want [on the UI-side] is a way to take multiple
time-series metrics, perform a custom aggregation, then output this
as a single time-series metric for a time series chart. However,
implementing this doesn't seem quick/straight-forward, based on the
way we currently fetch and expose this data [TODO - do more
investigation here].
[Piece 2] Add an "Other" bucket to the "SQL Queries" time series chart
This requires some design thought: how should we label this QPS metric
that just sums up "Other"; and should we change the label for
the QPS metric that currently sums up the SELECT/UPDATE/INSERT/DELETEs?
And what would that latter label be (see example image):
Jira issue: CRDB-4557
