Skip to content

server/status: add load related metrics#79023

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
darinpp:add-load-related-metrics
Apr 6, 2022
Merged

server/status: add load related metrics#79023
craig[bot] merged 1 commit intocockroachdb:masterfrom
darinpp:add-load-related-metrics

Conversation

@darinpp
Copy link
Copy Markdown
Contributor

@darinpp darinpp commented Mar 30, 2022

Previously there were only CPU related metrics available on the
_status/load endpoint. For serverless we need in addition to these, the
metrics which show the total number of current sql connections, the
number of sql queries executed and the number of jobs currently running
that are not idle. This PR adds the three new metrics by using selective
prometheus exporter and scraping the MetricsRecorder.

Release justification: Low risk, high reward changes to existing functionality

Release note: None

Will be re-based once #79021 and #79022 are merged.

@darinpp darinpp requested a review from a team March 30, 2022 04:35
@darinpp darinpp requested review from a team as code owners March 30, 2022 04:35
@darinpp darinpp requested review from samiskin and removed request for a team March 30, 2022 04:35
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@darinpp darinpp force-pushed the add-load-related-metrics branch from 3add6d5 to 651f39f Compare March 30, 2022 15:55
@abarganier
Copy link
Copy Markdown
Contributor

Similar to 79022, can we add any AlertingRule(s) and/or AggregationRule(s) for this?

@darinpp
Copy link
Copy Markdown
Contributor Author

darinpp commented Mar 30, 2022

Similar to 79022, can we add any AlertingRule(s) and/or AggregationRule(s) for this?

Can you be a bit more specific as to what the rules should be or why do we need to add them? The two of the metrics are not new, they are just being made available through a secondary endpoint. Does that imply a new alerting or aggregation rule?

Copy link
Copy Markdown
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the 79022, we just ask those implementing new metrics to consider it, but it's not mandatory. Thanks for looking into it & considering!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @samiskin)


pkg/jobs/registry.go, line 1314 at r6 (raw file):

			if isIdle {
				r.metrics.RunningNonIdleJobs.Dec(1)
				jm.CurrentlyIdle.Inc(1)

Why do we bother with the CurrentlyIdle metric if what we really need is the RunningNonIdleJobs metric? Shouldn't we get rid of CurrentlyIdle, which was only added for the benefit of the autoscaler to begin with?

Copy link
Copy Markdown
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @samiskin)


pkg/jobs/registry.go, line 1314 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Why do we bother with the CurrentlyIdle metric if what we really need is the RunningNonIdleJobs metric? Shouldn't we get rid of CurrentlyIdle, which was only added for the benefit of the autoscaler to begin with?

I wasn't aware that the only reason for these metrics is the autoscaler. I did find it however very useful for debugging purposes to have a breakdown of what actual jobs are running or idle. There isn't any other way to do that as far as I can see. Whatever we decide - we can remove the per job running/idle stats separately. Not as part of this PR

Previously there were only CPU related metrics available on the
_status/load endpoint. For serverless we need in addition to these, the
metrics which show the total number of current sql connections, the
number of sql queries executed and the number of jobs currently running
that are not idle. This PR adds the three new metrics by using selective
prometheus exporter and scraping the MetricsRecorder.

Release justification: Low risk, high reward changes to existing functionality
Release note: None
@darinpp darinpp force-pushed the add-load-related-metrics branch from 651f39f to 6fb0768 Compare April 5, 2022 21:29
Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @darinpp, and @samiskin)


pkg/server/load_endpoint.go, line 47 at r7 (raw file):

	// Exporter for the selected metrics that also show in /_status/vars.
	exporter2 := metric.MakePrometheusExporterForSelectedMetrics(map[string]struct{}{

It's gotten strange to call these "load metrics", since they now include metrics like non-idle jobs that aren't really load metrics.

Copy link
Copy Markdown
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @samiskin)


pkg/server/load_endpoint.go, line 47 at r7 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

It's gotten strange to call these "load metrics", since they now include metrics like non-idle jobs that aren't really load metrics.

Agree. Will remove the load endpoint #79388

@darinpp
Copy link
Copy Markdown
Contributor Author

darinpp commented Apr 5, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 6, 2022

Build succeeded:

@craig craig bot merged commit fc97617 into cockroachdb:master Apr 6, 2022
@darinpp darinpp deleted the add-load-related-metrics branch April 6, 2022 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants