ui: add an Overload dashboard by RaduBerinde · Pull Request #66595 · cockroachdb/cockroach

RaduBerinde · 2021-06-17T15:02:02Z

This commit adds an Overload dashboard in the metrics view. This is
intended to be a convenient way to monitor admission control.

The dashboard contains:

SQL Queries
Service latency
CPU Percent
Runnable Goroutines per CPU
L0 Sublevels and Files

Release note (ui change): a new Overload dashboard groups metrics that
are useful for admission control.

cockroach-teamcity · 2021-06-17T15:02:13Z

This change is

RaduBerinde · 2021-06-17T15:04:09Z

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

    >
      <Axis label="count">
        {nodeIDs.map((nid) => (

I couldn't figure out how to group these so I can show both metrics for n1, then both metrics for n2, etc. I tried putting both <Metric>s in the same => ( ), not sure what I'm doing wrong.

dhartunian

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @sumeerbhola)

pkg/ts/catalog/chart_catalog.go, line 2271 at r1 (raw file):

					"rocksdb.compacted-bytes-read",
					"rocksdb.compacted-bytes-written",
					"rocksdb.flushed-bytes",

Minor: why is this repeated here?

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

Previously, RaduBerinde wrote…

I couldn't figure out how to group these so I can show both metrics for n1, then both metrics for n2, etc. I tried putting both <Metric>s in the same => ( ), not sure what I'm doing wrong.

You might have been missing one of: {} instead of (), return in the lambda, or <> to group the multiple <Metrics> into a single JSX return object. If your lambda looks like the following it should work:

(nid) => {
  ...
  return (
    <>
    <Metric ... />
    <Metric ... />
    </>
  ) 
}

If it doesn't maybe try flatmap. I think the code that renders the plots, iterates through the children of <Axis> so it might get confused if they're grouped in 2s:

nodeIDs.flatMap((nid) => {
...
return [
<Metric ... />,
<Metric ... />
]
})

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @sumeerbhola)

pkg/ts/catalog/chart_catalog.go, line 2271 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Minor: why is this repeated here?

Oops, no idea. Fixed.

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

You might have been missing one of: {} instead of (), return in the lambda, or <> to group the multiple <Metrics> into a single JSX return object. If your lambda looks like the following it should work:
(nid) => {
  ...
  return (
    <>
    <Metric ... />
    <Metric ... />
    </>
  ) 
}
If it doesn't maybe try flatmap. I think the code that renders the plots, iterates through the children of <Axis> so it might get confused if they're grouped in 2s:

nodeIDs.flatMap((nid) => {
...
return [
<Metric ... />,
<Metric ... />
]
})
</blockquote></details>

Your `<>` sorcery worked! Thanks!

RaduBerinde · 2021-06-22T22:03:06Z

@sumeerbhola let me know if this is ok with you. This is not necessarily what I'm proposing we'll have in the release, but at least for now having all these on the same page would be useful to understand what admission control is doing.

sumeerbhola · 2021-06-23T02:54:58Z

@sumeerbhola let me know if this is ok with you

I meant to look at this on monday but am very behind on things because of oncall.

sumeerbhola

sorry about the delay

Reviewed 3 of 5 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @RaduBerinde)

pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

	}
	metaRdbL0Sublevels = metric.Metadata{
		Name:        "rocksdb.l0-sublevels",

There is no concept of sub-levels in rocksdb. I think we should start adding new metrics like this as pebble.l0-sublevels even though it potentially hurts discoverability.
@petermattis @itsbilal @jbowens do you have an opinion?

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

  } = props;

  return [

I am wondering whether sql latency should be on this dashboard. I think of an overload dashboard as something that one goes to to see if there is a resource overload, after one has seen some bad externally visible behavior (like high sql latency) on other dashboards. So it would focus on "resources": cpu, LSM health, root-sql BytesMonitor capacity and usage, (and when we have it) contention metrics for latches and locks.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @RaduBerinde, and @sumeerbhola)

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, sumeerbhola wrote…

I am wondering whether sql latency should be on this dashboard. I think of an overload dashboard as something that one goes to to see if there is a resource overload, after one has seen some bad externally visible behavior (like high sql latency) on other dashboards. So it would focus on "resources": cpu, LSM health, root-sql BytesMonitor capacity and usage, (and when we have it) contention metrics for latches and locks.

I agree with you in principle but I wanted to put them here for practical reasons - many times you are looking for correlations between the latency/throughput graphs and overload-related stats. It's hard to look for that if you have to keep switching between dashboards. It's even harder to extract a screenshot that shows a correlation.

If we had a way to see two dashboards one below the other, this wouldn't be necessary. Maybe I should file an issue about that?

itsbilal · 2021-06-23T18:34:40Z

pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, sumeerbhola wrote…

There is no concept of sub-levels in rocksdb. I think we should start adding new metrics like this as pebble.l0-sublevels even though it potentially hurts discoverability.
@petermattis @itsbilal @jbowens do you have an opinion?

Yeah I agree, we shouldn't be naming more things rocksdb. at least. I'd also prefer calling it storage.l0-sublevels as I'm not sure how discoverable the pebble name is, but either of those two is better than rocksdb.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @RaduBerinde)

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Maybe I should file an issue about that?

Yes, I would prefer that to be solved in a different manner. This issue occurs already, and I don't think we should pollute our curated dashboards to attempt to solve it.
One approach I've seen is a pin beside each graph which one can click. All the pinned graphs can then be seen on a separate pinned dashboard.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)

pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yeah I agree, we shouldn't be naming more things rocksdb. at least. I'd also prefer calling it storage.l0-sublevels as I'm not sure how discoverable the pebble name is, but either of those two is better than rocksdb.

What would be the consequences of renaming all existing rocksdb.* to pebble. or 'storage.'?

In any case, seems like I'd be setting a precedent here so I want to make sure I make the right choice for these new stats.

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, sumeerbhola wrote…

Maybe I should file an issue about that?

Yes, I would prefer that to be solved in a different manner. This issue occurs already, and I don't think we should pollute our curated dashboards to attempt to solve it.
One approach I've seen is a pin beside each graph which one can click. All the pinned graphs can then be seen on a separate pinned dashboard.

Filed #66798 and I'll remove the SQL Queries and SQL Latency graphs from this dashboard.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)

pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, RaduBerinde wrote…

What would be the consequences of renaming all existing rocksdb.* to pebble. or 'storage.'?

In any case, seems like I'd be setting a precedent here so I want to make sure I make the right choice for these new stats.

I don't think we want to rename existing ones. Customers may have tooling or custom dashboard links that rely on the existing names. So we'll have a mix of prefixes: rocksdb and pebble/storage.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)

pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't think we want to rename existing ones. Customers may have tooling or custom dashboard links that rely on the existing names. So we'll have a mix of prefixes: rocksdb and pebble/storage.

Renamed the new ones to storage.*

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, RaduBerinde wrote…

Filed #66798 and I'll remove the SQL Queries and SQL Latency graphs from this dashboard.

Done.

sumeerbhola

Reviewed 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)

This commit adds an Overload dashboard in the metrics view. This is intended to be a convenient way to monitor admission control. The dashboard contains: - CPU Percent - Runnable Goroutines per CPU - L0 Sublevels and Files Release note (ui change): a new Overload dashboard groups metrics that are useful for admission control.

RaduBerinde · 2021-06-29T20:19:23Z

bors r+

craig · 2021-06-29T21:52:34Z

Build failed:

GitHub CI (Cockroach)

RaduBerinde · 2021-06-30T17:23:46Z

Some acceptance flake.

bors r+

craig · 2021-06-30T18:45:12Z

Build succeeded:

GitHub CI (Cockroach)

RaduBerinde requested review from dhartunian and sumeerbhola June 17, 2021 15:02

dhartunian reviewed Jun 21, 2021

View reviewed changes

RaduBerinde force-pushed the overload-graph-page branch from 7133832 to 45ffa45 Compare June 22, 2021 15:24

RaduBerinde commented Jun 22, 2021

View reviewed changes

RaduBerinde force-pushed the overload-graph-page branch from 45ffa45 to ed4b235 Compare June 22, 2021 22:01

sumeerbhola requested changes Jun 23, 2021

View reviewed changes

RaduBerinde commented Jun 23, 2021

View reviewed changes

sumeerbhola requested changes Jun 23, 2021

View reviewed changes

RaduBerinde commented Jun 23, 2021

View reviewed changes

sumeerbhola reviewed Jun 24, 2021

View reviewed changes

RaduBerinde force-pushed the overload-graph-page branch 2 times, most recently from f620f37 to 22a23fe Compare June 28, 2021 19:45

RaduBerinde commented Jun 28, 2021

View reviewed changes

sumeerbhola approved these changes Jun 28, 2021

View reviewed changes

RaduBerinde force-pushed the overload-graph-page branch 2 times, most recently from e6c79fc to d431e43 Compare June 29, 2021 13:56

RaduBerinde force-pushed the overload-graph-page branch from d431e43 to 92bc49c Compare June 29, 2021 16:29

craig bot merged commit 6fbe61b into cockroachdb:master Jun 30, 2021

RaduBerinde deleted the overload-graph-page branch July 28, 2021 20:32

jseldess mentioned this pull request Sep 8, 2021

ui: add an Overload dashboard cockroachdb/docs#11556

Closed

Conversation

RaduBerinde commented Jun 17, 2021

Uh oh!

cockroach-teamcity commented Jun 17, 2021

Uh oh!

RaduBerinde commented Jun 17, 2021

Uh oh!

dhartunian left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Jun 22, 2021

Uh oh!

sumeerbhola commented Jun 23, 2021

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal commented Jun 23, 2021

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Jun 29, 2021

Uh oh!

craig bot commented Jun 29, 2021

Uh oh!

RaduBerinde commented Jun 30, 2021

Uh oh!

craig bot commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants