Skip to content

kv: txn giving up on refresh span collection causes closed ts to kick it out #44645

@knz

Description

@knz

Found by user:

  1. txn starts
  2. txn has a lot of operations whereby they exceed max_refresh_span_bytes and refresh span collection stops
  3. txn lasts for more than 30s
  4. closed ts "Catches up", doesn't find refresh spans and "kicks the txn out" (pushes it and client receives an error)
  5. the error is not the usual retry error because it is not caused by contention, but the error message does not clarify what is happening

There are three separate issues here:

  • we want a larger default for max_refresh_span_bytes so that the scenario becomes less likely. This is predicated on better memory tracking in KV, a separate work item (planned for 20.1, see the work @tbg has started on [dnm] kv: expose (and use) byte batch response size limit  #44341 ). I think this is orthogonal and should be kept out of scope here.

  • when the scenario happens we want the error message to be clearer about what needs to happen: either decrease the duration of the txn, or decrease the its number of refresh spans (fewer reads/writes), or increase max_refresh_span_bytes, or increase the closed ts delay

  • or we could avoid the situation entirely? Make the closed ts lag behind the long-running txn if it has disabled refresh spans collection.

cc @ajwerner @tbg for triage.

Jira issue: CRDB-5215

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)S-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions