Skip to content

obs: export metrics about Go GC Assist work #88178

@nvb

Description

@nvb

CockroachDB currently exports some timeseries metrics about the Go runtime memory GC:

  • sys.gc.pause.percent: "Current GC pause percentage"
  • sys.gc.pause.ns: "Total GC pause"
  • sys.gc.count: "Total number of GC runs"

These metrics are computed using runtime.ReadGCStats. They are useful to understand how long the Go GC is stopping the world ("pause" refers to STW work, not background work).

However, GO GC has other costs beyond its STW sweep termination and mark termination phases. In general, the concurrent mark and scan phase can be run without pushing back on foreground goroutines. However, when goroutines are allocating memory faster than GC can clean up (either because of significant memory allocation, slow GC, or both), GC work can be pushed back on foreground goroutines in line with their heap allocations. This is known as "GC Assist".

We've seen in cases like this one that GC assist can lead to large spikes in latency that are difficult to understand using other observability tools.

We should find a way to expose this information. Unfortunately, this is not exported by the go runtime, except through the GODEBUG=gctrace tooling. We may need to patch the runtime or upstream a fix to get at the information programmatically.

Jira issue: CRDB-19718

Epic CRDB-34227

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-observabilityA-observability-infC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsv23.2.12

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions