-
Notifications
You must be signed in to change notification settings - Fork 4.1k
obs: export metrics about Go GC Assist work #88178
Copy link
Copy link
Closed
Labels
A-kv-observabilityA-observability-infC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsv23.2.12
Metadata
Metadata
Assignees
Labels
A-kv-observabilityA-observability-infC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsv23.2.12
Type
Fields
Give feedbackNo fields configured for issues without a type.
CockroachDB currently exports some timeseries metrics about the Go runtime memory GC:
sys.gc.pause.percent: "Current GC pause percentage"sys.gc.pause.ns: "Total GC pause"sys.gc.count: "Total number of GC runs"These metrics are computed using
runtime.ReadGCStats. They are useful to understand how long the Go GC is stopping the world ("pause" refers to STW work, not background work).However, GO GC has other costs beyond its STW sweep termination and mark termination phases. In general, the concurrent mark and scan phase can be run without pushing back on foreground goroutines. However, when goroutines are allocating memory faster than GC can clean up (either because of significant memory allocation, slow GC, or both), GC work can be pushed back on foreground goroutines in line with their heap allocations. This is known as "GC Assist".
We've seen in cases like this one that GC assist can lead to large spikes in latency that are difficult to understand using other observability tools.
We should find a way to expose this information. Unfortunately, this is not exported by the go runtime, except through the
GODEBUG=gctracetooling. We may need to patch the runtime or upstream a fix to get at the information programmatically.Jira issue: CRDB-19718
Epic CRDB-34227