Skip to content

backup: job fails with "could not find valid split key" error #25770

@nvb

Description

@nvb

I've recently seen backups failing with errors like the following:

Error: creating backup for table customer: pq: could not mark job 350118039979982849 as failed: split failed while applying backpressure: could not find valid split key: split failed while applying backpressure: could not find valid split key

This is related to #25261 and specifically #24215. It looks like the jobs table row is growing to over 128MB in size, which is triggering the new backpressure mechanism. I suspect that we're updating the row too often, which is causing this huge amount of MVCC growth. Before the backpressure change, these single row ranges were allowed to grow to arbitrary sizes, but this is problematic for a number of reasons. We should adjust backup so that it doesn't create such a large row footprint. There are a few ways that we could get around this issue:

  • reduce how often we update the jobs table to a reasonable rate. Do we really need to checkpoint so often that the row grows to these sizes?
  • move commonly changed columns to a different column family. This could help if most of the row (size-wise) is static, and we're just changing a small column constantly.
  • make jobs table rows inline. I don't think this is actually the approach we want to take, but it would fix the issue.

This is easily reproducible by running:

./workload fixtures make tpcc --warehouses=10000

cc. @dt @benesch

Metadata

Metadata

Assignees

Labels

A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions