storage: trigger GC based on SysCount/SysBytes by tbg · Pull Request #45573 · cockroachdb/cockroach

tbg · 2020-03-02T10:07:54Z

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely
experiencing a large abort span. The abort span is not supposed to
become that large, but it does happen and causes stability fallout,
usually due to a combination of shortcomings:

there's no trigger for GC based on abort span size alone (before
this commit)
transaction aborts tended to create unnecessary abort span entries,
fixed (and 19.2-backported) in:
storage: write to AbortSpan less, clean up AbortSpan more #42765
aborting transactions in a busy loop:
schemachange: attempting to update succeeded job over and over #38088
(and we suspect this also happens in user apps occasionally)
large snapshots would never complete due to the queue time limits
(addressed in release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952).

In an ideal world, we would factor in the abort span into this method
directly, but until then the condition guarding this block will do.
At worst, there is some other reason for SysBytes become that large
while also incurring a large SysCount, but I'm not sure how this
would happen. The only other things in this span are the versioned
range descriptors (which follow regular GC and it's only ever adding
a SysCount of one), or transaction records (which expire like abort
span records).

Release note (bug fix): Range garbage collection will now trigger
based on a large abort span, adding defense-in-depth against ranges
growing large (and eventually unstable).

cockroach-teamcity · 2020-03-02T10:08:01Z

This change is

ajwerner

How do you feel about adding a check for the time since last GC? My specific concern is that we somehow dump megabytes of abort spans in some bug but then cannot GC them because they don't expire for an hour. I'd be okay with time since last GC just being that hour.

Otherwise,

Reviewed 1 of 2 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)

pkg/storage/gc_queue.go, line 343 at r1 (raw file):

	r.FinalScore = r.FuzzFactor * (valScore + r.IntentScore)

	if probablyLargeAbortSpan(ms) {

nit: if probablyLargeAbortSpan(ms) && !r.ShouldQueue {?

tbg · 2020-03-04T08:16:19Z

Good idea, done. PTAL @ajwerner

ajwerner

mod the test failure

Reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/gc_queue_test.go, line 150 at r2 (raw file):

		r := makeGCQueueScoreImpl(
			context.Background(), seed,
			hlc.Timestamp{WallTime: thresh + 1},

I think this is off by 1. Either change your condition to be >= r.LikelyLastGC > storagebase.TxnCleanupThreshold or make this + 2?

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

tbg · 2020-03-05T11:43:34Z

TFTR! I fixed the test (and cleaned it up while I was there).

bors r=ajwerner

craig · 2020-03-05T12:29:38Z

Build succeeded

GitHub CI (Cockroach)

tbg requested a review from ajwerner March 2, 2020 10:07

tbg force-pushed the gcqueue-abort-span branch from 2a7dfa4 to f3dd7ee Compare March 2, 2020 12:11

ajwerner reviewed Mar 2, 2020

View reviewed changes

tbg force-pushed the gcqueue-abort-span branch 2 times, most recently from 1066058 to 7bff2f5 Compare March 4, 2020 08:16

ajwerner approved these changes Mar 4, 2020

View reviewed changes

tbg force-pushed the gcqueue-abort-span branch from 7bff2f5 to 18a98e5 Compare March 5, 2020 11:43

tbg mentioned this pull request Mar 5, 2020

backport-19.2: storage: trigger GC based on SysCount/SysBytes #45744

Merged

craig bot merged commit 344d083 into cockroachdb:master Mar 5, 2020

irfansharif mentioned this pull request Jun 18, 2020

kv: improve abort span GC behavior #50376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: trigger GC based on SysCount/SysBytes#45573

storage: trigger GC based on SysCount/SysBytes#45573
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:gcqueue-abort-span

tbg commented Mar 2, 2020

Uh oh!

cockroach-teamcity commented Mar 2, 2020

Uh oh!

ajwerner left a comment

Uh oh!

tbg commented Mar 4, 2020

Uh oh!

ajwerner left a comment

Uh oh!

tbg commented Mar 5, 2020

Uh oh!

craig bot commented Mar 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tbg commented Mar 2, 2020

Uh oh!

cockroach-teamcity commented Mar 2, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Mar 4, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Mar 5, 2020

Uh oh!

craig bot commented Mar 5, 2020

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants