obs: export metrics about Go GC Assist work#118875
Conversation
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
checking out the test failures |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go line 344 at r1 (raw file):
// in the metricSamples field. func (grm *GoRuntimeSampler) sampleRuntimeMetrics() { metrics.Read(grm.metricSamples)
The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.
I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.
pkg/server/status/runtime.go line 357 at r1 (raw file):
metricSamples := make([]metrics.Sample, len(metricNames)) metricIndexes := make(map[string]int, len(metricNames)) for i, n := range metricNames {
One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go line 704 at r1 (raw file):
gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines() gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)
The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
cannot figure out the experimental test failures, seems like a directory on the github action host is missing |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 5 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)
-- commits line 6 at r2:
GcAssistSecond
pkg/server/status/runtime.go line 344 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.
I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.
RuntimeStatSampler can be shared across tenants, but only so that each of those tenants has access to the metric fields. What you have here seems appropriate.
pkg/server/status/runtime.go line 357 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier
I think what you have here makes sense, because we don't want to pay for metrics in the metrics.Read call that we aren't using.
pkg/server/status/runtime.go line 704 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?
I think that would make sense, to more closely mirror metrics like sys.gc.pause.ns.
pkg/server/status/runtime.go line 98 at r2 (raw file):
Unit: metric.Unit_PERCENT, } metaGCAssitSeconds = metric.Metadata{
"Assist"
pkg/server/status/runtime.go line 99 at r2 (raw file):
} metaGCAssitSeconds = metric.Metadata{ Name: "sys.gc.assit.seconds",
"assist"
pkg/server/status/runtime.go line 100 at r2 (raw file):
metaGCAssitSeconds = metric.Metadata{ Name: "sys.gc.assit.seconds", Help: "Estimated total CPU time goroutines spent to assist the GC process",
"user goroutines"
pkg/server/status/runtime.go line 303 at r2 (raw file):
var getCgoMemStats func(context.Context) (uint, uint, error) // Estimated total CPU time goroutines spent performing GC tasks to assist the GC and prevent it from falling behind the
nit: wrap at 80 chars.
pkg/server/status/runtime.go line 306 at r2 (raw file):
// application. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare // only with other /cpu/classes metrics. const GcAssist = "/cpu/classes/gc/mark/assist:cpu-seconds"
This name is a little too broad to be in the global namespace of this package. And it also doesn't need to be exported. How about runtimeMetricGCAssist?
pkg/server/status/runtime.go line 314 at r2 (raw file):
// The collection of metrics we want to sample. metricSamples []metrics.Sample // The mapping to find metric slot in runtimeSamples by name.
s/runtimeSamples/metricSamples/
pkg/server/status/runtime.go line 318 at r2 (raw file):
} // getIndex find postion of metrics in the sample array by name.
"finds the position"
pkg/server/status/runtime.go line 327 at r2 (raw file):
} // getFloat64 get the sampled value by metrics name as getFloat64.
"gets"
pkg/server/status/runtime.go line 327 at r2 (raw file):
} // getFloat64 get the sampled value by metrics name as getFloat64.
s/getFloat64/as a float64/
pkg/server/status/runtime.go line 337 at r2 (raw file):
// in the metricSamples field. func (grm *GoRuntimeSampler) sampleRuntimeMetrics() { metrics.Read(grm.metricSamples)
Have you taken a look at how long this call takes?
pkg/server/status/runtime.go line 485 at r2 (raw file):
log.Ops.Errorf(ctx, "could not get initial network stat counters: %v", err) } runtimeMetrics := []string{GcAssist}
Let's move this to a global, right up below GcAssist.
pkg/server/status/runtime.go line 587 at r2 (raw file):
// The CGoMemStats should be provided via GetCGoMemStats(). func (rsr *RuntimeStatSampler) SampleEnvironment( ctx context.Context, ms *GoMemStats, cs *CGoMemStats,
Have we taken a look at which runtime metrics we'll need to use to eliminate the use of runtime.ReadMemStats, in order to address #88178 (comment)? Do all of those metrics exist in runtime/metrics?
pkg/server/status/runtime.go line 697 at r2 (raw file):
gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines() gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)
"assist" here and everywhere else 😃
lyang24
left a comment
There was a problem hiding this comment.
Thank you for the detailed review. I will take one more look tmr and send it for another look.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/server/status/runtime.go line 337 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Have you taken a look at how long this call takes?
added benchmark in the comment
pkg/server/status/runtime.go line 587 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Have we taken a look at which runtime metrics we'll need to use to eliminate the use of
runtime.ReadMemStats, in order to address #88178 (comment)? Do all of those metrics exist inruntime/metrics?
yes I think i have found the metrics that have the same semantics as those fields used in memstats but i haven't yet verified the go runtime impl to see if they have exactly the same impl.
This is the mapping of memstats usage to runtime metrics in my mind
GoAllocBytes: ms.HeapAlloc => /gc/heap/allocs:bytes
HeapFragmentBytes: ms.HeapInuse - ms.HeapAlloc => /memory/classes/heap/unused:bytes
HeapReservedBytes: ms.HeapIdle - ms.HeapReleased => /memory/classes/heap/free:bytes
HeapReleasedBytes: ms.HeapReleased => /memory/classes/heap/released:bytes
MemStackSysBytes: ms.StackSys => /memory/classes/heap/stacks:bytes
goTotal := ms.Sys - ms.HeapReleased => /memory/classes/total:bytes
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
pkg/util/log/eventpb/health_events.proto line 69 at r3 (raw file):
// Estimated total CPU time user goroutines spent performing GC tasks to // assist the GC. Expressed in nanoseconds. uint64 gc_assist_ns = 20 [(gogoproto.customname) = "GCAssistNs", (gogoproto.jsontag) = ",omitempty"];
does it make sense to add this metrics to health_events
nvb
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)
-- commits line 6 at r3:
"capture represent"
pkg/server/status/runtime.go line 337 at r2 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
added benchmark in the comment
Thanks for doing that!
pkg/util/log/eventpb/health_events.proto line 69 at r3 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
does it make sense to add this metrics to health_events
I don't think it needs to be added here. gc_pause_ns is not part of this either. It also looks like there are some compatibility concerns we'll need to contend with if we add the metric to health_events — see doc.go. So I'm 👍 on removing it from the PR.
This commit introduced functions to extract exposed metrics in go runtime metrics api. The runtime metrics is sampled along in SampleEnvironment call every 10 seconds. New metric GcAssistNS is added as an estimate to amount of effort of user go routines assist in gc activities in nanoseconds. Fixes: cockroachdb#88178 Release note: None
lyang24
left a comment
There was a problem hiding this comment.
TFTR! I will replace go memstats with runtime metrics in the following pr
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"capture represent"
done
nvb
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @abarganier)
|
bors r+ |
|
Build succeeded: |
|
@lyang24 do you want to open a follow-up PR to add this new metric to the "Runtime" dashboard in the UI, right below the "GC Pause Time" graph? |
This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: cockroachdb#118875 Release note: None
119506: ui: add charts on gc assist metric r=lyang24 a=lyang24 This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: #118875 Release note: None gc assist on crdb single node with kv workload gcttl=5s for database in (kv, system)  119877: roachtest: make pg_regress weekly and its failures non-blocking r=yuzefovich a=yuzefovich This should reduce the amount of toil for us until we iron out most of the differences with postgres. Epic: None Release note: None Co-authored-by: lyang24 <lanqingy@usc.edu> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks. Informs: cockroachdb#118875 Release note: None
This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.
Fixes: #88178
Relase note: None