Skip to content

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

@dt

Description

@dt

We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.

We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #115164 ).

These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.

The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to golang/go#64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing.

This is a tracking issue for all related issues.

Jira issue: CRDB-33924

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-disaster-recovery

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions