Skip to content

ui: add an Overload dashboard#66595

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:overload-graph-page
Jun 30, 2021
Merged

ui: add an Overload dashboard#66595
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:overload-graph-page

Conversation

@RaduBerinde
Copy link
Copy Markdown
Member

This commit adds an Overload dashboard in the metrics view. This is
intended to be a convenient way to monitor admission control.

The dashboard contains:

  • SQL Queries
  • Service latency
  • CPU Percent
  • Runnable Goroutines per CPU
  • L0 Sublevels and Files

Release note (ui change): a new Overload dashboard groups metrics that
are useful for admission control.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member Author


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

    >
      <Axis label="count">
        {nodeIDs.map((nid) => (

I couldn't figure out how to group these so I can show both metrics for n1, then both metrics for n2, etc. I tried putting both <Metric>s in the same => ( ), not sure what I'm doing wrong.

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @sumeerbhola)


pkg/ts/catalog/chart_catalog.go, line 2271 at r1 (raw file):

					"rocksdb.compacted-bytes-read",
					"rocksdb.compacted-bytes-written",
					"rocksdb.flushed-bytes",

Minor: why is this repeated here?


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

Previously, RaduBerinde wrote…

I couldn't figure out how to group these so I can show both metrics for n1, then both metrics for n2, etc. I tried putting both <Metric>s in the same => ( ), not sure what I'm doing wrong.

You might have been missing one of: {} instead of (), return in the lambda, or <> to group the multiple <Metrics> into a single JSX return object. If your lambda looks like the following it should work:

(nid) => {
  ...
  return (
    <>
    <Metric ... />
    <Metric ... />
    </>
  ) 
}

If it doesn't maybe try flatmap. I think the code that renders the plots, iterates through the children of <Axis> so it might get confused if they're grouped in 2s:

nodeIDs.flatMap((nid) => {
...
return [
<Metric ... />,
<Metric ... />
]
})

@RaduBerinde RaduBerinde force-pushed the overload-graph-page branch from 7133832 to 45ffa45 Compare June 22, 2021 15:24
Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @sumeerbhola)


pkg/ts/catalog/chart_catalog.go, line 2271 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Minor: why is this repeated here?

Oops, no idea. Fixed.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 128 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

You might have been missing one of: {} instead of (), return in the lambda, or <> to group the multiple <Metrics> into a single JSX return object. If your lambda looks like the following it should work:

(nid) => {
  ...
  return (
    <>
    <Metric ... />
    <Metric ... />
    </>
  ) 
}

If it doesn't maybe try flatmap. I think the code that renders the plots, iterates through the children of <Axis> so it might get confused if they're grouped in 2s:

nodeIDs.flatMap((nid) => {
...
return [
<Metric ... />,
<Metric ... />
]
})

</blockquote></details>

Your `<>` sorcery worked! Thanks!


<!-- Sent from Reviewable.io -->

@RaduBerinde RaduBerinde force-pushed the overload-graph-page branch from 45ffa45 to ed4b235 Compare June 22, 2021 22:01
@RaduBerinde
Copy link
Copy Markdown
Member Author

@sumeerbhola let me know if this is ok with you. This is not necessarily what I'm proposing we'll have in the release, but at least for now having all these on the same page would be useful to understand what admission control is doing.

@sumeerbhola
Copy link
Copy Markdown
Collaborator

@sumeerbhola let me know if this is ok with you

I meant to look at this on monday but am very behind on things because of oncall.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the delay

Reviewed 3 of 5 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @RaduBerinde)


pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

	}
	metaRdbL0Sublevels = metric.Metadata{
		Name:        "rocksdb.l0-sublevels",

There is no concept of sub-levels in rocksdb. I think we should start adding new metrics like this as pebble.l0-sublevels even though it potentially hurts discoverability.
@petermattis @itsbilal @jbowens do you have an opinion?


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

  } = props;

  return [

I am wondering whether sql latency should be on this dashboard. I think of an overload dashboard as something that one goes to to see if there is a resource overload, after one has seen some bad externally visible behavior (like high sql latency) on other dashboards. So it would focus on "resources": cpu, LSM health, root-sql BytesMonitor capacity and usage, (and when we have it) contention metrics for latches and locks.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @RaduBerinde, and @sumeerbhola)


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, sumeerbhola wrote…

I am wondering whether sql latency should be on this dashboard. I think of an overload dashboard as something that one goes to to see if there is a resource overload, after one has seen some bad externally visible behavior (like high sql latency) on other dashboards. So it would focus on "resources": cpu, LSM health, root-sql BytesMonitor capacity and usage, (and when we have it) contention metrics for latches and locks.

I agree with you in principle but I wanted to put them here for practical reasons - many times you are looking for correlations between the latency/throughput graphs and overload-related stats. It's hard to look for that if you have to keep switching between dashboards. It's even harder to extract a screenshot that shows a correlation.

If we had a way to see two dashboards one below the other, this wouldn't be necessary. Maybe I should file an issue about that?

@itsbilal
Copy link
Copy Markdown
Contributor


pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, sumeerbhola wrote…

There is no concept of sub-levels in rocksdb. I think we should start adding new metrics like this as pebble.l0-sublevels even though it potentially hurts discoverability.
@petermattis @itsbilal @jbowens do you have an opinion?

Yeah I agree, we shouldn't be naming more things rocksdb. at least. I'd also prefer calling it storage.l0-sublevels as I'm not sure how discoverable the pebble name is, but either of those two is better than rocksdb.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @RaduBerinde)


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Maybe I should file an issue about that?

Yes, I would prefer that to be solved in a different manner. This issue occurs already, and I don't think we should pollute our curated dashboards to attempt to solve it.
One approach I've seen is a pin beside each graph which one can click. All the pinned graphs can then be seen on a separate pinned dashboard.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)


pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yeah I agree, we shouldn't be naming more things rocksdb. at least. I'd also prefer calling it storage.l0-sublevels as I'm not sure how discoverable the pebble name is, but either of those two is better than rocksdb.

What would be the consequences of renaming all existing rocksdb.* to pebble. or 'storage.'?

In any case, seems like I'd be setting a precedent here so I want to make sure I make the right choice for these new stats.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, sumeerbhola wrote…

Maybe I should file an issue about that?

Yes, I would prefer that to be solved in a different manner. This issue occurs already, and I don't think we should pollute our curated dashboards to attempt to solve it.
One approach I've seen is a pin beside each graph which one can click. All the pinned graphs can then be seen on a separate pinned dashboard.

Filed #66798 and I'll remove the SQL Queries and SQL Latency graphs from this dashboard.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)


pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, RaduBerinde wrote…

What would be the consequences of renaming all existing rocksdb.* to pebble. or 'storage.'?

In any case, seems like I'd be setting a precedent here so I want to make sure I make the right choice for these new stats.

I don't think we want to rename existing ones. Customers may have tooling or custom dashboard links that rely on the existing names. So we'll have a mix of prefixes: rocksdb and pebble/storage.

@RaduBerinde RaduBerinde force-pushed the overload-graph-page branch 2 times, most recently from f620f37 to 22a23fe Compare June 28, 2021 19:45
Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)


pkg/kv/kvserver/metrics.go, line 410 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't think we want to rename existing ones. Customers may have tooling or custom dashboard links that rely on the existing names. So we'll have a mix of prefixes: rocksdb and pebble/storage.

Renamed the new ones to storage.*


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx, line 36 at r2 (raw file):

Previously, RaduBerinde wrote…

Filed #66798 and I'll remove the SQL Queries and SQL Latency graphs from this dashboard.

Done.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 4 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @jbowens, @petermattis, and @sumeerbhola)

@RaduBerinde RaduBerinde force-pushed the overload-graph-page branch 2 times, most recently from e6c79fc to d431e43 Compare June 29, 2021 13:56
This commit adds an Overload dashboard in the metrics view. This is
intended to be a convenient way to monitor admission control.

The dashboard contains:
 - CPU Percent
 - Runnable Goroutines per CPU
 - L0 Sublevels and Files

Release note (ui change): a new Overload dashboard groups metrics that
are useful for admission control.
@RaduBerinde RaduBerinde force-pushed the overload-graph-page branch from d431e43 to 92bc49c Compare June 29, 2021 16:29
@RaduBerinde
Copy link
Copy Markdown
Member Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 29, 2021

Build failed:

@RaduBerinde
Copy link
Copy Markdown
Member Author

Some acceptance flake.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 30, 2021

Build succeeded:

@craig craig bot merged commit 6fbe61b into cockroachdb:master Jun 30, 2021
@RaduBerinde RaduBerinde deleted the overload-graph-page branch July 28, 2021 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants