storage,kvserver: Improve SST collision checking for wide SSTs by itsbilal · Pull Request #81062 · cockroachdb/cockroach

itsbilal · 2022-05-05T18:39:23Z

This change includes 4 changes to significantly improve
the performance of CheckSSTCollisions (and by extension,
AddSSTable) for cases where we add very wide sstables
relative to the engine:

Check if we're adding an sstable that has a small number of
keys, or has a 100x greater overlap with the engine relative
to its own size. In that case, switch to doing prefix seeks
inside CheckSSTCollisions, similar to what we did in storage: Use prefix iteration for CheckSSTConflicts #73514
and then reverted later).
Write bloom filters in L6 by default, to help speed up 1.
Also thread a new IterOption to only optionally read these
filter blocks in prefix iteration, defaulting to not reading
them. This iterator option hooks into the one added in
Pebble in *: Add IterOption to optionally read L6 filter blocks. pebble#1685 ).
Add a ParseHook to allow command-line pebble options to
allow configuring the writing of bloom filters.
Update intentInterleavingIter to not prefix-seek the
intent/lock table iterator if we are in prefix seek mode and
the MVCC iterator returned nothing.

Fixes #80980.

Release note (performance improvement): Significantly improve
performance of IMPORTs when the source is producing data
not sorted by the destination table's primary key, especially
if the destination table has a very large primary key with
lots of columns.

cockroach-teamcity · 2022-05-05T18:39:32Z

This change is

itsbilal · 2022-05-06T14:15:29Z

Actually, it'd be better for me to split this into multiple commits. At least the write-L6-bloom-filters-by-default piece can be moved into its own commit, as that'll be the only thing not backported to 22.1.

sumeerbhola

Reviewed 13 of 13 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 172 at r1 (raw file):

		bytes, err := cArgs.EvalCtx.GetApproximateDiskBytes(start.Key, end.Key)
		if err == nil {
			usePrefixSeek = bytes > 100*uint64(len(sst))

Can you add some code commentary about how these two constants were chosen. It will help someone in the future if they want to refine the heuristic.

pkg/server/config.go line 636 at r1 (raw file):

						case "none":
							return nil, nil
						case "rocksdb.BuiltinBloomFilter":

we already do this in DefaultPebbleOptions, so I am confused on why we are adding this.

pkg/storage/intent_interleaving_iter.go line 331 at r1 (raw file):

		intentSeekKey = nil
		i.intentKey = nil
	}

are we covering this in the datadriven test?

pkg/storage/pebble.go line 637 at r2 (raw file):

	// of the benefit of having bloom filters on every level for only 10% of the
	// memory cost.
	opts.Levels[6].FilterPolicy = nil

don't we need a way for someone to configure that they want L0-L5 bloomfilters, but not L6 bloomfilters?

pkg/storage/sst.go line 74 at r1 (raw file):

	}

	extIter := reader.NewMVCCIterator(MVCCKeyAndIntentsIterKind, IterOptions{

nit: can this extIter initialization be moved down after the sstIter initialization, so it is all in one place?

pkg/storage/sst.go line 102 at r1 (raw file):

		extOK, extErr = extIter.Valid()
	}
	for sstErr == nil && sstOK && (usePrefixSeek || (extOK && extErr == nil)) {

Can you try coding separate loops for the two cases and see if it looks simple and doesn't have too much code duplication. The loop here is confusing me -- large parts are gated by usePrefixSeek and the part that does the coordinated seeks isn't gated by !seekPrefixGE which is also confusing.

pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 703 at r1 (raw file):

						// non-prefix Seek in the conflict check. Sending in nil stats
						// makes this easier as that forces the function to rely exclusively
						// on ApproxDiskBytes.

I didn't understand this comment. I thought approxDiskBytes==0 is when we want the SeekPrefixGE to be used. And why don't we always compute the MVCCStats for the sst?

pkg/storage/sst_test.go line 90 at r1 (raw file):

					startKey, endKey := MVCCKey{Key: roachpb.Key(start)}, MVCCKey{Key: roachpb.Key(end)}
					_, err := CheckSSTConflicts(ctx, sstFile.Bytes(), engine, startKey, endKey,
						false /*disallowShadowing*/, hlc.Timestamp{} /*disallowShadowingBelow*/, tc.maxIntents, false /* usePrefixSeek */)

are we not testing the true case?

jbowens

Reviewed 6 of 13 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)

pkg/storage/engine.go line 295 at r1 (raw file):

	// is minimized if the probability of the key existing is not low or if
	// this is a one-time Seek (where loading the data block directly is better).
	UseL6Filters bool

Is this expected to be used outside of storage.CheckSSTConflicts? I think we should generally err towards fewer options and knobs exposed through the storage package interface. Can we unexport it?

pkg/storage/sst.go line 110 at r1 (raw file):

			// key in sstIter. Note that we can't just use an exhausted extIter
			// as a sign that we are done; extIter could be skipping keys, so it
			// must be re-seeked.

by "skipping keys" this comment is referring to the fact that exhaustion of a prefix iterator indicates the prefix is exhausted, not that there are no keys after the prefix? I think it could use some rephrasing.

nicktrav

Reviewed 13 of 13 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

-- commits line 40 at r2:

mitigated

nit: is it worth mentioning the tradeoff of additional disk usage due to the writing of these L6 filter blocks?

Also - is this worth of a release note, given it is "visible" under the interpretation that an operator would observe a material disk usage increase after this change.

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 172 at r1 (raw file):

Previously, sumeerbhola wrote…

Can you add some code commentary about how these two constants were chosen. It will help someone in the future if they want to refine the heuristic.

In addition to the commentary, mind lifting to a const (either in this method, or outside of it). Will focus the commentary a bit more.

pkg/storage/sst.go line 67 at r1 (raw file):

		nonPrefixIter := reader.NewMVCCIterator(MVCCKeyAndIntentsIterKind, IterOptions{UpperBound: end.Key})
		nonPrefixIter.SeekGE(start)
		valid, _ := nonPrefixIter.Valid()

nit: store err here and return a few lines below? IIUC valid will be false if err != nil, so you don't need to check the err.

itsbilal

TFTRs!

I actually hadn't run the microbenchmark earlier for figuring out a good threshold to set usePrefixSeek. Here it is. I just adapted BenchmarkCheckSSTCollisions to test along the overlap bytes axis, not along the earlier number-of-versions axis. Old = usePrefixSeek false, new = usePrefixSeek true:

CheckSSTConflicts/keys=1000/sstKeys=10/overlap=true/prefixSeek=false-24         76.6µs ± 1%   69.7µs ± 2%    -9.01%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=1000/sstKeys=200/overlap=true/prefixSeek=false-24         379µs ± 1%    181µs ± 1%   -52.07%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/sstKeys=1000/overlap=true/prefixSeek=false-24        887µs ± 2%    654µs ± 1%   -26.23%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=1000/sstKeys=10000/overlap=true/prefixSeek=false-24      2.13ms ± 1%   5.90ms ± 1%  +177.30%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=10000/sstKeys=10/overlap=true/prefixSeek=false-24        76.7µs ± 1%   67.9µs ± 1%   -11.44%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=10000/sstKeys=200/overlap=true/prefixSeek=false-24        414µs ± 1%    181µs ± 1%   -56.20%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/sstKeys=1000/overlap=true/prefixSeek=false-24      1.76ms ± 1%   0.66ms ± 1%   -62.53%  (p=0.000 n=9+10)
CheckSSTConflicts/keys=10000/sstKeys=10000/overlap=true/prefixSeek=false-24     7.96ms ± 2%   5.99ms ± 1%   -24.70%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=100000/sstKeys=10/overlap=true/prefixSeek=false-24       73.8µs ± 1%   66.3µs ± 1%   -10.16%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=100000/sstKeys=200/overlap=true/prefixSeek=false-24       472µs ± 0%    176µs ± 0%   -62.63%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/sstKeys=1000/overlap=true/prefixSeek=false-24     2.09ms ± 1%   0.65ms ± 0%   -68.81%  (p=0.000 n=10+8)
CheckSSTConflicts/keys=100000/sstKeys=10000/overlap=true/prefixSeek=false-24    19.4ms ± 1%    6.1ms ± 1%   -68.84%  (p=0.000 n=10+9)
CheckSSTConflicts/keys=1000/sstKeys=100/overlap=true/prefixSeek=false-24         219µs ± 0%    123µs ± 1%   -44.12%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/sstKeys=100000/overlap=true/prefixSeek=false-24     6.97ms ± 0%  58.27ms ± 0%  +735.83%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/sstKeys=100/overlap=true/prefixSeek=false-24        240µs ± 2%    123µs ± 1%   -48.57%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/sstKeys=100000/overlap=true/prefixSeek=false-24    21.5ms ± 1%   59.2ms ± 2%  +174.97%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/sstKeys=100/overlap=true/prefixSeek=false-24       257µs ± 2%    120µs ± 2%   -53.25%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/sstKeys=100000/overlap=true/prefixSeek=false-24   81.3ms ± 1%   60.5ms ± 0%   -25.59%  (p=0.008 n=5+5)

I've split it up between overlap ratios between sst keys (numerator) and engine keys (denominator):

1/10000:
CheckSSTConflicts/keys=100000/sstKeys=10/overlap=true/prefixSeek=false-24       73.8µs ± 1%   66.3µs ± 1%   -10.16%  (p=0.000 n=10+10)

1/1000:
CheckSSTConflicts/keys=10000/sstKeys=10/overlap=true/prefixSeek=false-24        76.7µs ± 1%   67.9µs ± 1%   -11.44%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=100000/sstKeys=100/overlap=true/prefixSeek=false-24       257µs ± 2%    120µs ± 2%   -53.25%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/sstKeys=200/overlap=true/prefixSeek=false-24       472µs ± 0%    176µs ± 0%   -62.63%  (p=0.008 n=5+5)

1/100:
CheckSSTConflicts/keys=1000/sstKeys=10/overlap=true/prefixSeek=false-24         76.6µs ± 1%   69.7µs ± 2%    -9.01%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=10000/sstKeys=100/overlap=true/prefixSeek=false-24        240µs ± 2%    123µs ± 1%   -48.57%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/sstKeys=200/overlap=true/prefixSeek=false-24        414µs ± 1%    181µs ± 1%   -56.20%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=100000/sstKeys=1000/overlap=true/prefixSeek=false-24     2.09ms ± 1%   0.65ms ± 0%   -68.81%  (p=0.000 n=10+8)

1/10:
CheckSSTConflicts/keys=1000/sstKeys=100/overlap=true/prefixSeek=false-24         219µs ± 0%    123µs ± 1%   -44.12%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=1000/sstKeys=200/overlap=true/prefixSeek=false-24         379µs ± 1%    181µs ± 1%   -52.07%  (p=0.008 n=5+5)
CheckSSTConflicts/keys=10000/sstKeys=1000/overlap=true/prefixSeek=false-24      1.76ms ± 1%   0.66ms ± 1%   -62.53%  (p=0.000 n=9+10)
CheckSSTConflicts/keys=100000/sstKeys=10000/overlap=true/prefixSeek=false-24    19.4ms ± 1%    6.1ms ± 1%   -68.84%  (p=0.000 n=10+9)

1/1:
CheckSSTConflicts/keys=1000/sstKeys=1000/overlap=true/prefixSeek=false-24        887µs ± 2%    654µs ± 1%   -26.23%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=10000/sstKeys=10000/overlap=true/prefixSeek=false-24     7.96ms ± 2%   5.99ms ± 1%   -24.70%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=100000/sstKeys=100000/overlap=true/prefixSeek=false-24   81.3ms ± 1%   60.5ms ± 0%   -25.59%  (p=0.008 n=5+5)

10/1:
CheckSSTConflicts/keys=1000/sstKeys=10000/overlap=true/prefixSeek=false-24      2.13ms ± 1%   5.90ms ± 1%  +177.30%  (p=0.000 n=10+10)
CheckSSTConflicts/keys=10000/sstKeys=100000/overlap=true/prefixSeek=false-24    21.5ms ± 1%   59.2ms ± 2%  +174.97%  (p=0.008 n=5+5)

100/1:
CheckSSTConflicts/keys=1000/sstKeys=100000/overlap=true/prefixSeek=false-24     6.97ms ± 0%  58.27ms ± 0%  +735.83%  (p=0.008 n=5+5)

Excluding the tiny-sst cases (where very little time is spent iterating through keys), we see that cases where the sst:engine overlap is 1:10 or lower (i.e. more engine keys per sst key), then there's a clear >50% benefit to using prefix seeks. At 1:1 it's more of a wash, and when the ratio is more sst-heavy, it's definitely better to do the regular seek.

Maybe this calls for setting the ratio to 10 instead of the currently-conservative 100. I could rerun my end-to-end experiments with the const at 10, but I'm sure it won't change anything because all the overlap ratios (in bytes) that I saw there were in the 200-10,000 range.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

-- commits line 40 at r2:

Previously, nicktrav (Nick Travers) wrote…

mitigated

nit: is it worth mentioning the tradeoff of additional disk usage due to the writing of these L6 filter blocks?

Also - is this worth of a release note, given it is "visible" under the interpretation that an operator would observe a material disk usage increase after this change.

I don't think that'd be necessary. We saw only a 1% increase in space amplification in L6 with filter blocks enabled. For the LSm as a whole, the % increase will be even lower. The memory usage of repeatedly reading that block is a bigger concern and that's why we mitigated it with that IterOption.

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 172 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

In addition to the commentary, mind lifting to a const (either in this method, or outside of it). Will focus the commentary a bit more.

Done.

pkg/server/config.go line 636 at r1 (raw file):

Previously, sumeerbhola wrote…

we already do this in DefaultPebbleOptions, so I am confused on why we are adding this.

The goal is to be able to opt into L6 bloom filters with a command line argument. Currently you can't do that because our parser ignores filter policies. A customer that wants to do a big import on 22.1 could opt into it by passing this flag, without us changing the default behaviour.

The DefaultPebbleOptions change is in the second commit which won't be backported.

pkg/storage/engine.go line 295 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Is this expected to be used outside of storage.CheckSSTConflicts? I think we should generally err towards fewer options and knobs exposed through the storage package interface. Can we unexport it?

Unexported. Feels a little weird to have an unexported field here but it's better than the alternative.

pkg/storage/intent_interleaving_iter.go line 331 at r1 (raw file):

Previously, sumeerbhola wrote…

are we covering this in the datadriven test?

yep, pkg/storage/testdata/intent_interleaving_iter/basic:143 tests this already.

pkg/storage/pebble.go line 637 at r2 (raw file):

Previously, sumeerbhola wrote…

don't we need a way for someone to configure that they want L0-L5 bloomfilters, but not L6 bloomfilters?

They can do that by passing in a [Level 6]TableFilter=none pebble option. But we're changing the default write-time behaviour on master (after 22.1), as the disk space amplification is minimal (1%). The bigger worry is read-time memory usage, which we'll mitigate with the IterOption defaulting to false except in CheckSSTCollisions.

pkg/storage/sst.go line 67 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: store err here and return a few lines below? IIUC valid will be false if err != nil, so you don't need to check the err.

Done.

pkg/storage/sst.go line 74 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: can this extIter initialization be moved down after the sstIter initialization, so it is all in one place?

Done.

pkg/storage/sst.go line 102 at r1 (raw file):

Previously, sumeerbhola wrote…

Can you try coding separate loops for the two cases and see if it looks simple and doesn't have too much code duplication. The loop here is confusing me -- large parts are gated by usePrefixSeek and the part that does the coordinated seeks isn't gated by !seekPrefixGE which is also confusing.

There are only two places in the loop where we use if usePrefixSeek, so the separate loop thing duplicates a lot of code. I did add more comments if that helps. What part are you referring to where we do the coordinated seeks? The one where cmp < 0 down below? In that case we would be seeking the engine iterator forward either way. I could maybe rewrite that if statement, but you'd still need a conditional for the sstIter.NextKey() part

Any of the cases near the bottom where we're calling a Next on the sst iter and then seeking the engine iterator, we don't need to do anything as those cases will work as-is for both kinds of seeks. The trySeekUsingNext optimization also helps a lot there.

pkg/storage/sst.go line 110 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

by "skipping keys" this comment is referring to the fact that exhaustion of a prefix iterator indicates the prefix is exhausted, not that there are no keys after the prefix? I think it could use some rephrasing.

Yep I meant exhaustion mostly, but also I just generally meant that we can't rely on prefix iteration to give us all keys. Clarifying.

pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 703 at r1 (raw file):

Previously, sumeerbhola wrote…

I didn't understand this comment. I thought approxDiskBytes==0 is when we want the SeekPrefixGE to be used. And why don't we always compute the MVCCStats for the sst?

approxDiskBytes=0 is no overlap with the engine, so we want a regular seek to just skip past everything.

Not sending in stats is a weird test-only knob to get around the "if less than 100 keys then always prefix seek" check in cmd_add_sstable, because that relies on the stats to be passed in. We still calculate stats but further down in that function. That's what the comment is talking about. I'll clarify it a little bit.

pkg/storage/sst_test.go line 90 at r1 (raw file):

Previously, sumeerbhola wrote…

are we not testing the true case?

We were, in the EvalAddSSTable test. But now I fixed it here too, good catch.

nicktrav

All my feedback addressed.

Reviewed 12 of 12 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 89 at r3 (raw file):

// i.e. the engine:sst byte/key ratio is high and skewed in favour of the engine.
// In those cases, since we are less likely to skip SST keys with regular seeks
// on the engine, it is more efficient to utilize bloom filters the engine's SSTs

nit: "utilize bloom filters on the engine's SSTs"?

jbowens

Reviewed 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

pkg/storage/pebble_iterator.go line 129 at r4 (raw file):

	// through in pebble.
	//
	// p.options.useL6Filters = opts.useL6Filters

Pebble bump is merged now. 👍

pkg/storage/sst.go line 102 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

There are only two places in the loop where we use if usePrefixSeek, so the separate loop thing duplicates a lot of code. I did add more comments if that helps. What part are you referring to where we do the coordinated seeks? The one where cmp < 0 down below? In that case we would be seeking the engine iterator forward either way. I could maybe rewrite that if statement, but you'd still need a conditional for the sstIter.NextKey() part

Any of the cases near the bottom where we're calling a Next on the sst iter and then seeking the engine iterator, we don't need to do anything as those cases will work as-is for both kinds of seeks. The trySeekUsingNext optimization also helps a lot there.

Since the usePrefixSeek-gated sections are concentrated at the beginning, where we're positioning the iterators, maybe there's a split implementation that factors out the remainder of the loop into a compareForConflicts function?

pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 703 at r1 (raw file):

approxDiskBytes=0 is no overlap with the engine, so we want a regular seek to just skip past everything.

Do we know how frequent this case is? Is it only relevant for AddSSTables appending to the very end of the keyspace? For the same reason snapshots tend to end up in L0 (#80589), I'd expect boundary overlap would produce small non-zero byte estimates for key ranges other than those at the very end of the keyspace (eg, during an ordered import into a single table with no indexes).

itsbilal

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @nicktrav, and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 89 at r3 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: "utilize bloom filters on the engine's SSTs"?

Done.

pkg/storage/pebble_iterator.go line 129 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Pebble bump is merged now. 👍

Done.

pkg/storage/sst.go line 102 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Since the usePrefixSeek-gated sections are concentrated at the beginning, where we're positioning the iterators, maybe there's a split implementation that factors out the remainder of the loop into a compareForConflicts function?

Done.

pkg/kv/kvserver/batcheval/cmd_add_sstable_test.go line 703 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

approxDiskBytes=0 is no overlap with the engine, so we want a regular seek to just skip past everything.

Do we know how frequent this case is? Is it only relevant for AddSSTables appending to the very end of the keyspace? For the same reason snapshots tend to end up in L0 (#80589), I'd expect boundary overlap would produce small non-zero byte estimates for key ranges other than those at the very end of the keyspace (eg, during an ordered import into a single table with no indexes).

I saw it a lot for sorted imports, where the overlap was straight up 0 in the engine. More commonly than when the overlap was a small nonzero value. Either way this is just a special case in the test, the actual code in practice doesn't special-case == 0.

sumeerbhola

Reviewed 4 of 12 files at r3, 3 of 6 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

pkg/storage/sst.go line 86 at r6 (raw file):

	}
	defer sstIter.Close()
	sstIter.SeekGE(start)

can you move this sstIter.SeekGE to after compareForCollision so it is in the same place as sstIter.Valid()?

pkg/storage/sst.go line 224 at r6 (raw file):

			if err := ctx.Err(); err != nil {
				return enginepb.MVCCStats{}, err
			}

Can we move the following here and not do it before the for loop?

extIter.SeekGE(MVCCKey{Key: sstIter.UnsafeKey().Key})
extOK, extErr = extIter.Valid()

pkg/storage/sst.go line 234 at r6 (raw file):

				sstOK, sstErr = sstIter.Valid()
				if sstOK {
					extIter.SeekGE(MVCCKey{Key: sstIter.UnsafeKey().Key})

this seek could happen in the next iteration instead of being coded here.

Also, we seem to be ignoring the case where extErr != nil.

pkg/storage/sst.go line 244 at r6 (raw file):

			// Keep seeking the iterators until both keys are equal.
			if cmp := bytes.Compare(extKey.Key, sstKey.Key); cmp < 0 {

aren't the cases cmp < 0 and cmp > 0 an internal error since we are doing prefix iteration and SeekPrefixGE guarantees that the prefix must match?

itsbilal

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @nicktrav, and @sumeerbhola)

pkg/storage/sst.go line 86 at r6 (raw file):

Previously, sumeerbhola wrote…

can you move this sstIter.SeekGE to after compareForCollision so it is in the same place as sstIter.Valid()?

Done.

pkg/storage/sst.go line 224 at r6 (raw file):

Previously, sumeerbhola wrote…

Can we move the following here and not do it before the for loop?
extIter.SeekGE(MVCCKey{Key: sstIter.UnsafeKey().Key})
extOK, extErr = extIter.Valid()

Done.

pkg/storage/sst.go line 234 at r6 (raw file):

Previously, sumeerbhola wrote…

this seek could happen in the next iteration instead of being coded here.

Also, we seem to be ignoring the case where extErr != nil.

Done.

pkg/storage/sst.go line 244 at r6 (raw file):

Previously, sumeerbhola wrote…

aren't the cases cmp < 0 and cmp > 0 an internal error since we are doing prefix iteration and SeekPrefixGE guarantees that the prefix must match?

Not necessarily, since the prefix is defined as only the user key part of the MVCC key and we could still have cmp < 0 or cmp > 0 for cases where we find a different version of the other key.

sumeerbhola

Reviewed 1 of 2 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

pkg/storage/sst.go line 244 at r6 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Not necessarily, since the prefix is defined as only the user key part of the MVCC key and we could still have cmp < 0 or cmp > 0 for cases where we find a different version of the other key.

But bytes.Compare(extKey.Key, sstKey.Key) is comparing the roachpb.Key, which is the prefix.

pkg/storage/sst.go line 212 at r8 (raw file):

		if sstOK {
			extIter.SeekGE(MVCCKey{Key: sstIter.UnsafeKey().Key})
		}

I was suggesting removing this SeekGE since we now do it at the beginning of the for loop.

pkg/storage/sst.go line 213 at r8 (raw file):

			extIter.SeekGE(MVCCKey{Key: sstIter.UnsafeKey().Key})
		}
		// In the case of prefix seeks, only look at sst iterator exhaustion/errors.

... seeks, do not look at engine prefix iter exhaustion. This is because ...

(since we do want to look at engine prefix errors)

itsbilal

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @nicktrav, and @sumeerbhola)

pkg/storage/sst.go line 244 at r6 (raw file):

Previously, sumeerbhola wrote…

But bytes.Compare(extKey.Key, sstKey.Key) is comparing the roachpb.Key, which is the prefix.

Done. Good point.

pkg/storage/sst.go line 212 at r8 (raw file):

Previously, sumeerbhola wrote…

I was suggesting removing this SeekGE since we now do it at the beginning of the for loop.

Oops, yeah my bad. Done.

pkg/storage/sst.go line 213 at r8 (raw file):

Previously, sumeerbhola wrote…

... seeks, do not look at engine prefix iter exhaustion. This is because ...

(since we do want to look at engine prefix errors)

Done.

sumeerbhola

Reviewed 1 of 12 files at r3, 1 of 2 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

pkg/storage/sst.go line 242 at r10 (raw file):

			// We just seeked the engine iter. If it has a mismatching prefix,
			// the iterator is not obeying its contract.
			if cmp := bytes.Compare(extKey.Key, sstKey.Key); cmp != 0 {

nit: could use bytes.Equal.

This change includes 4 changes to significantly improve the performance of CheckSSTCollisions (and by extension, AddSSTable) for cases where we add very wide sstables relative to the engine: 1) Check if we're adding an sstable that has a small number of keys, or has a 100x greater overlap with the engine relative to its own size. In that case, switch to doing prefix seeks inside CheckSSTCollisions, similar to what we did in cockroachdb#73514 and then reverted later). 2) Thread a new IterOption to only optionally read L6 filter blocks in prefix iteration, defaulting to not reading them. This iterator option hooks into the one added in Pebble in cockroachdb/pebble#1685 ). 3) Add a ParseHook to allow command-line pebble options to allow configuring the writing of bloom filters. This allows L6 filter block writing to be turned on through a command line argument. 4) Update intentInterleavingIter to not prefix-seek the intent/lock table iterator if we are in prefix seek mode and the MVCC iterator returned nothing. Fixes cockroachdb#80980. Release note (performance improvement): Significantly improve performance of IMPORTs when the source is producing data not sorted by the destination table's primary key, especially if the destination table has a very large primary key with lots of columns.

Previously, we did not write bloom filters on L6 sstables. The main worry here was the higher memory usage and the overhead of loading a large filter block for the large L6 SSTs in memory. This is now mitigated thanks to the optional reading of L6 filter blocks only in narrow use cases where they're most helpful. Release note: None.

itsbilal · 2022-05-26T02:20:12Z

TFTR!

bors r=sumeerbhola,nicktrav

craig · 2022-05-26T03:17:22Z

Build succeeded:

GitHub CI (Cockroach)

Skip TestStoreMetrics assertions about bloom filter lookups. Pebble now only consults L6 bloom filters if explicitly requested, and CockroachDB only requests their use when checking for ingested sstable collisions (see cockroachdb#81062). Release note: none Fix cockroachdb#81823.

81859: util/mon: augment "budget exceeded" errors from root monitor with a hint r=yuzefovich a=yuzefovich This commit adjusts the root SQL memory monitor to return "budget exceeded" errors with a hint about considering increasing the `--max-sql-memory` startup argument. This should hopefully help the users to be a bit more self-sufficient. Here is an example error from SQL shell: ``` ERROR: scan with start key /Table/105/2/0: root: memory budget exceeded: 4198400 bytes requested, 4689920 currently allocated, 8388608 bytes in budget SQLSTATE: 53200 HINT: Consider increasing --max-sql-memory startup parameter. ``` Release note: None 81923: kv/kvserver: fix flaky TestStoreMetrics r=nicktrav,nvanbenschoten a=jbowens Skip TestStoreMetrics assertions about bloom filter lookups. Pebble now only consults L6 bloom filters if explicitly requested, and CockroachDB only requests their use when checking for ingested sstable collisions (see #81062). Release note: none Fix #81823. 81927: sql: audit json.AsText for nil return value r=yuzefovich a=yuzefovich This commit audits all calls of `json.AsText` to make sure that we always do nil-pointer check. I think all of the callsites where this commit adds the check are impossible to hit the nil-pointer panic, but it's still good to have the checks explicitly. Informs: #81097. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

itsbilal requested review from a team as code owners May 5, 2022 18:39

itsbilal self-assigned this May 5, 2022

itsbilal force-pushed the improve-checksstcollisions branch from 66132fd to 0e1fa35 Compare May 5, 2022 19:54

itsbilal force-pushed the improve-checksstcollisions branch from 0e1fa35 to 195485f Compare May 6, 2022 14:29

itsbilal force-pushed the improve-checksstcollisions branch 2 times, most recently from 91340f7 to a4093cc Compare May 16, 2022 14:58

sumeerbhola requested changes May 18, 2022

View reviewed changes

jbowens reviewed May 18, 2022

View reviewed changes

nicktrav reviewed May 18, 2022

View reviewed changes

itsbilal force-pushed the improve-checksstcollisions branch 3 times, most recently from d459f4f to 40a9a92 Compare May 20, 2022 20:31

itsbilal commented May 20, 2022

View reviewed changes

nicktrav approved these changes May 20, 2022

View reviewed changes

nicktrav mentioned this pull request May 23, 2022

Improve IMPORT performance for larger unsorted data sets of #79615

Closed

4 tasks

jbowens reviewed May 23, 2022

View reviewed changes

itsbilal force-pushed the improve-checksstcollisions branch from 40a9a92 to 823ba67 Compare May 24, 2022 16:23

itsbilal commented May 24, 2022

View reviewed changes

itsbilal force-pushed the improve-checksstcollisions branch from 823ba67 to fbcf2e0 Compare May 24, 2022 16:56

nicktrav mentioned this pull request May 24, 2022

kv/kvserver: consider bumping (or removing) server side rate limiters in favor of admission control #81772

Closed

sumeerbhola requested changes May 25, 2022

View reviewed changes

itsbilal force-pushed the improve-checksstcollisions branch from fbcf2e0 to dc37cce Compare May 25, 2022 19:14

itsbilal commented May 25, 2022

View reviewed changes

sumeerbhola requested changes May 25, 2022

View reviewed changes

itsbilal force-pushed the improve-checksstcollisions branch from dc37cce to 555e92c Compare May 25, 2022 23:56

itsbilal commented May 25, 2022

View reviewed changes

sumeerbhola approved these changes May 26, 2022

View reviewed changes

itsbilal added 2 commits May 25, 2022 20:28

itsbilal force-pushed the improve-checksstcollisions branch from 555e92c to 7531327 Compare May 26, 2022 00:28

craig bot merged commit 7a17498 into cockroachdb:master May 26, 2022

cockroach-teamcity mentioned this pull request May 26, 2022

storage,kvserver: Improve SST collision checking for wide SSTs cockroachdb/docs#13953

Closed

jbowens mentioned this pull request May 26, 2022

kv/kvserver: fix flaky TestStoreMetrics #81923

Merged

itsbilal mentioned this pull request Jun 10, 2022

[release-22.1] storage,kvserver: Improve SST collision checking for wide SSTs #82746

Merged

nicktrav mentioned this pull request Jun 23, 2022

storage: metrics / tracing visibility into Pebble iteration mode (prefix vs. non-prefix seeks) #83305

Closed

Conversation

itsbilal commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 5, 2022

Uh oh!

itsbilal commented May 6, 2022

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrav left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrav left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal commented May 26, 2022

Uh oh!

craig bot commented May 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

itsbilal commented May 5, 2022 •

edited

Loading