backup: elevated tail latencies in SQL workload while backing up to s3

We have observed that backups to s3 cause increased tail latencies in foreground traffic, sometimes significantly.

We currently see some cases where 600+ of heap is inuse by the chunk buffers in the sdk, leading to increased rates of GC (even absent memory pressure but rather just due to its size relative to the live heap if a reasonable GOGC is not set e.g. due to #[115164](https://github.com/cockroachdb/cockroach/issues/115164) ). 

These more frequent GC runs appear to also see higher per-run pause times, sometimes much higher.

The S3 SDK hashes chunk sized (currently 8mb) blocks both with MD5 and SHA256, for content checksum and signing respectively. It appears that due in large part to https://github.com/golang/go/issues/64417, this causes us to observe long gc pause times and traces show STW pauses overlapping with block hashing. 

This is a tracking issue for all related issues.
- [x] https://github.com/cockroachdb/cockroach/issues/115194
- [x] https://github.com/cockroachdb/cockroach/issues/115189
- [x] https://github.com/cockroachdb/cockroach/issues/115192
- [x] https://github.com/cockroachdb/cockroach/issues/115164
- [x] https://github.com/cockroachdb/cockroach/issues/115193
- [ ] https://github.com/cockroachdb/cockroach/issues/115196

Jira issue: CRDB-33924

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

backup: elevated tail latencies in SQL workload while backing up to s3 #115190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions