Skip to content

db: implement min-overlapping ratio heuristic#707

Merged
jbowens merged 1 commit intocockroachdb:masterfrom
jbowens:jackson/minoverlap-rangedel
Jun 17, 2020
Merged

db: implement min-overlapping ratio heuristic#707
jbowens merged 1 commit intocockroachdb:masterfrom
jbowens:jackson/minoverlap-rangedel

Conversation

@jbowens
Copy link
Copy Markdown
Contributor

@jbowens jbowens commented May 21, 2020

Implement a variant of RocksDB's min-overlapping ratio compaction heuristic.

@petermattis
Copy link
Copy Markdown
Collaborator

This change is Reviewable

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 10 times, most recently from 420da88 to 28de527 Compare May 26, 2020 12:51
@jbowens jbowens changed the title [WIP] db: implement min-overlapping ratio heuristic db: implement min-overlapping ratio heuristic May 26, 2020
@jbowens jbowens requested a review from petermattis May 26, 2020 12:56
@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 2 times, most recently from 39e3b82 to 75b3493 Compare May 26, 2020 18:46
@jbowens
Copy link
Copy Markdown
Contributor Author

jbowens commented May 26, 2020

Here are some numbers collected via a handful of pebble bench compact workloads:

                                master           75b3493e
replicate-rebalance-3to5
  test time                     20.72m             16.20m  ( -21.81%)
  read amp                           4                  4       -
  write amp                       3.50               3.36  (  -4.00%)
  space amp                       1.24               1.08  ( -12.90%)
  avg db size                    8.1 G              7.1 G  ( -12.35%)

tpccbench
  test time                     47.63m             50.09m  (  +5.16%)
  read amp                           6                  6       -
  write amp                       2.08               2.04  (  -1.92%)
  space amp                       1.07               1.07       -
  avg db size                     39 G               39 G       -

clearrange
  test time                      6.84m              7.01m  (  +2.49%)
  read amp                           4                  2  (-100.00%)
  write amp                       1.06               1.06       -
  space amp                   55961.61               1.11  (- 99.99%)
  avg db size                     32 G               32 G       -

tpcc-headroom-n4cpu16
  test time                     62.44m            71.60m  ( +14.67%)
  read amp                           6                 6       -
  write amp                       2.61              2.73  (  +4.60%)
  space amp                       1.09              1.09       -
  avg db size                     41 G              41 G       -

composite*
  test time                     20.72m            16.73m  (- 19.26%)
  read amp                           4                 3  (- 25.00%)
  write amp                       5.96              5.40  (-  9.40%)
  space amp                      33.68              1.96  (- 94.18%)
  avg db size                    2.8 G             2.8 G       -

* composite was simulataneous tpcc, bank, queue, movr and ledger
  workoads run against the same cluster, with the databases
  dropped at varying times to create range tombstones.

The tpcc workloads suffer some on the new heuristic. It makes sense that the workloads with range deletions throughout would fair better, since freeing space earlier can save you from performing unnecessary compactions. I'm a little surprised at the tpcc numbers though. An earlier prototype of just the min-overlapping ratio change saw a marked decrease in both time and write amplification on a tpcc workload. I'm going to experiment a little more.

@jbowens
Copy link
Copy Markdown
Contributor Author

jbowens commented May 26, 2020

roachtest tpccbench/nodes=3/cpu=4 using this branch for pebble on GCP got 736 max warehouses

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you measuring read amp? If it is just measured at the end of the run then I think we're not getting a good picture. I think we'd want a read-amp-integrated over time metric. Or average-read-amp over the lifetime of the workload.

roachtest tpccbench/nodes=3/cpu=4 using this branch for pebble on GCP got 736 max warehouses

How does this compare against master? Are you relying on https://roachperf.crdb.dev/ for comparison? tpccbench isn't particularly stable in my experience.

Reviewable status: 0 of 17 files reviewed, all discussions resolved (waiting on @petermattis)

Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you measuring read amp? If it is just measured at the end of the run then I think we're not getting a good picture. I think we'd want a read-amp-integrated over time metric. Or average-read-amp over the lifetime of the workload.

Yeah, it's just the read amp at the end of the run. I'll update that today.

How does this compare against master? Are you relying on https://roachperf.crdb.dev/ for comparison? tpccbench isn't particularly stable in my experience.

I ran it overnight against master and got 667 warehouses, but not sure if it's just noise. Is there another tpcc roachtest that's more stable?

Reviewable status: 0 of 17 files reviewed, all discussions resolved

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration of compensated size into compaction picking looks good. I'm less confident about the current structure of updating table stats. See details in the comments below.

Reviewable status: 0 of 17 files reviewed, 11 unresolved discussions (waiting on @jbowens)


compaction_picker.go, line 90 at r1 (raw file):

	// that may be reclaimed by compacting the file's range tombstones.
	if f.Stats.Valid {
		sz += f.Stats.RangeDeletionsBytesEstimate

f.stats.RangeDeletionsBytesEstimate will be 0 if f.Stats.Valid == false, right? Could compensatedSize() instead be return f.Size + f.stats.RangeDeletionsBytesEstimate?


compaction_picker.go, line 427 at r1 (raw file):

	cmp := p.opts.Comparer.Compare
	targetFiles := p.vers.Files[outputLevel]

Nit: s/targetFiles/outputLevelFiles/g or something like that.


compaction_picker.go, line 439 at r1 (raw file):

		for targetIndex < len(targetFiles) && base.InternalCompare(cmp, targetFiles[targetIndex].Largest, f.Smallest) < 0 {
			targetIndex++
		}

Do we need to keep a separate targetIndex variable? I think you could accomplish the same by stripping off a prefix of targetFiles:

for len(targetFiles) > 0 && base.InternalCompare(cmp, targetFiles[0].Largest, f.Smallest) < 0 {
  targetFiles = targetFiles[1:]
}

compaction_picker.go, line 442 at r1 (raw file):

		for targetIndex < len(targetFiles) && base.InternalCompare(cmp, targetFiles[targetIndex].Smallest, f.Largest) < 0 {
			overlappingBytes += targetFiles[targetIndex].Size

I wonder if we should look at targetFiles[targetIndex].Compacting here. If targetFiles[targetIndex].Compacting is true, we're not going to be able to do the compaction, but will only determine that later.


open.go, line 323 at r1 (raw file):

	d.mu.tableStats.cond.L = &d.mu.Mutex
	if !d.opts.ReadOnly && !d.opts.disableTableStats {
		d.mu.versions.newVersionHook = d.updateTableStats

Should versionSet.load invoke newVersionHook. It seems a bit asymmetrical that it doesn't.


table_stats.go, line 13 at r1 (raw file):

	"github.com/cockroachdb/pebble/sstable"
)

This code is nice and clear, yet I think there could be a file-level comment explaining how stats collection works. In particular, that stats collection is done incrementally at open, operates on a snapshot of the version via a readState.


table_stats.go, line 92 at r1 (raw file):

	moreRemain := false
	for l, ff := range rs.current.Files {

This isn't accounting for L0 sublevels. I'm not sure if it needs to, only pointing out that is true. I think we should call this out in the TableStats doc comments.


table_stats.go, line 148 at r1 (raw file):

			// When we estimate the disk usage, we ignore sequence numbers,
			// so we don't double count a key range.
			if d.cmp(startUserKey, start.UserKey) == 0 {

I think you could also continue if the previous tombstone's end user-key equals this tombstone's start user-key. This would effectively merge abutting range tombstones and reduce the number of loops below. I'm not sure how important this is in practice, or just something that occurs in tests.


version_set.go, line 454 at r1 (raw file):

	}

	vs.newVersionHook(bve.Added)

It is unfortunate that the new version hook has too loop over all of the sstables. An alternate approach would be to ensure that the table stats are properly populated for an sstable generated by a ingest/flush/compaction. Then the only sstables which we don't know stats for are when we start up. Are there downsides to this alternative approach? I'm generally somewhat wary of O(num-sstable) loops and want to minimize them.

Thinking about this more, I think there are a few structural challenges. During flush/compaction, you could either keep track of the range tombstones as they are being added to an sstable and estimate the overlap with lower levels, or there could be an asynchronous update of the stats after the flush/compaction is finished if the new sstables contain range tombstones.


internal/manifest/version.go, line 49 at r1 (raw file):

	// this table's range deletions to the bottom of the LSM. This estimate is
	// at data-block granularity and is not updated if compactions beneath the
	// table reduce the amount of reclaimable disk space.

Per comment elsewhere, note that this does not account for overlapped data in L0, but this error is expected to be small.


sstable/writer.go, line 33 at r1 (raw file):

	SmallestSeqNum      uint64
	LargestSeqNum       uint64
	Statistics          Statistics

I wonder if this should be Properties. There was no strong reason to not expose the written Properties, we've just never had an impetus before now.

Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 11 unresolved discussions (waiting on @jbowens and @petermattis)


version_set.go, line 454 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It is unfortunate that the new version hook has too loop over all of the sstables. An alternate approach would be to ensure that the table stats are properly populated for an sstable generated by a ingest/flush/compaction. Then the only sstables which we don't know stats for are when we start up. Are there downsides to this alternative approach? I'm generally somewhat wary of O(num-sstable) loops and want to minimize them.

Thinking about this more, I think there are a few structural challenges. During flush/compaction, you could either keep track of the range tombstones as they are being added to an sstable and estimate the overlap with lower levels, or there could be an asynchronous update of the stats after the flush/compaction is finished if the new sstables contain range tombstones.

One thing to note here (and that I should probably document more clearly in code) is that newVersionHook is only iterating over files added in the new version, rather than the version's full file set. If it finds new files missing stats, the separate tableStats goroutine will still asynchronously iterate over all of the current readState's files.

The O(num-sstable) loop in scanMissingTables for every new version adding files with range tombstones is unfortunate. One idea I entertained for removing scanMissingTableStats's O(num-sstable) loop was to append a (*version, [numLevels][]*fileMetadata) tuple to d.mu.tableStats from the newVersionHook. That would allow the async loadMissingTableStats goroutine to calculate stats of new files always with respect to the version they were added in.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 11 unresolved discussions (waiting on @jbowens)


version_set.go, line 454 at r1 (raw file):

One thing to note here (and that I should probably document more clearly in code) is that newVersionHook is only iterating over files added in the new version, rather than the version's full file set. If it finds new files missing stats, the separate tableStats goroutine will still asynchronously iterate over all of the current readState's files.

Ah, I missed that. I thought it was taking the entire new version.

The O(num-sstable) loop in scanMissingTables for every new version adding files with range tombstones is unfortunate. One idea I entertained for removing scanMissingTableStats's O(num-sstable) loop was to append a (*version, [numLevels][]*fileMetadata) tuple to d.mu.tableStats from the newVersionHook. That would allow the async loadMissingTableStats goroutine to calculate stats of new files always with respect to the version they were added in.

That would certainly help, though I keep returning the idea that we should calculate table stats for flushed/compacted/ingested tables either during flush/compaction/ingestion, or immediately afterwards. You'll have the list of added sstables there.

Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 11 unresolved discussions (waiting on @jbowens and @petermattis)


version_set.go, line 454 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

One thing to note here (and that I should probably document more clearly in code) is that newVersionHook is only iterating over files added in the new version, rather than the version's full file set. If it finds new files missing stats, the separate tableStats goroutine will still asynchronously iterate over all of the current readState's files.

Ah, I missed that. I thought it was taking the entire new version.

The O(num-sstable) loop in scanMissingTables for every new version adding files with range tombstones is unfortunate. One idea I entertained for removing scanMissingTableStats's O(num-sstable) loop was to append a (*version, [numLevels][]*fileMetadata) tuple to d.mu.tableStats from the newVersionHook. That would allow the async loadMissingTableStats goroutine to calculate stats of new files always with respect to the version they were added in.

That would certainly help, though I keep returning the idea that we should calculate table stats for flushed/compacted/ingested tables either during flush/compaction/ingestion, or immediately afterwards. You'll have the list of added sstables there.

Ok, I'll try something out. If we calculate table stats immediately after a flush/compaction/ingestion, it would still be asynchronous right? Something like we finish the compaction, update the read state, append the new files for which stats must be calculated and then start a separate goroutine to load them if one's not yet running?

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 11 unresolved discussions (waiting on @jbowens)


version_set.go, line 454 at r1 (raw file):

Ok, I'll try something out. If we calculate table stats immediately after a flush/compaction/ingestion, it would still be asynchronous right? Something like we finish the compaction, update the read state, append the new files for which stats must be calculated and then start a separate goroutine to load them if one's not yet running?

Yes, the loading could still be asynchronous, though it could also be synchronous if done before DB.mu is reacquired to install the compaction.

Another thought, when doing a compaction, if the inputs already have stats computed can't we compute the stats for the new table(s) using the inputs? This might not be stable, just curious if it is possible.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 2 times, most recently from f1d81c7 to 403dc06 Compare May 29, 2020 16:07
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 10 unresolved discussions (waiting on @jbowens and @petermattis)


compaction_picker.go, line 90 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

f.stats.RangeDeletionsBytesEstimate will be 0 if f.Stats.Valid == false, right? Could compensatedSize() instead be return f.Size + f.stats.RangeDeletionsBytesEstimate?

Done.


compaction_picker.go, line 439 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do we need to keep a separate targetIndex variable? I think you could accomplish the same by stripping off a prefix of targetFiles:

for len(targetFiles) > 0 && base.InternalCompare(cmp, targetFiles[0].Largest, f.Smallest) < 0 {
  targetFiles = targetFiles[1:]
}

Done.


compaction_picker.go, line 442 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should look at targetFiles[targetIndex].Compacting here. If targetFiles[targetIndex].Compacting is true, we're not going to be able to do the compaction, but will only determine that later.

That's a good idea. Done.


open.go, line 323 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should versionSet.load invoke newVersionHook. It seems a bit asymmetrical that it doesn't.

Resolved by reorganizing updateTableStats calls into compaction / flush / ingest code paths.


table_stats.go, line 13 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This code is nice and clear, yet I think there could be a file-level comment explaining how stats collection works. In particular, that stats collection is done incrementally at open, operates on a snapshot of the version via a readState.

Done.


version_set.go, line 454 at r1 (raw file):

Yes, the loading could still be asynchronous, though it could also be synchronous if done before DB.mu is reacquired to install the compaction.

I'm not sure if it matters in practice, but I was wary of adding synchronous work to a compaction since active compactions block additional compactions if at the concurrent compaction limit.

Another thought, when doing a compaction, if the inputs already have stats computed can't we compute the stats for the new table(s) using the inputs? This might not be stable, just curious if it is possible.

I think the fact that an input file might be split across multiple output files makes it difficult (impossible?). It's also an obstacle to tracking dropped keys, because whenever you have a file-level property, how do you split it across multiple outputs?


internal/manifest/version.go, line 49 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Per comment elsewhere, note that this does not account for overlapped data in L0, but this error is expected to be small.

Done.


sstable/writer.go, line 33 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if this should be Properties. There was no strong reason to not expose the written Properties, we've just never had an impetus before now.

Done.

Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the table stats so that new files are appended to a pending slice alongside a readState. In a stable state, the stats collection job doesn't need to do any O(num sstables) scans. When a database is first opened, it does perform those scans, incrementally loading existing tables stats until it's sure that all of the initial tables' stats have been populated. It does have the disadvantage of being more complex. I couldn't think of any easy way to merge these two codepaths.

Here's benchstat output for some roachtests of the original SHA: https://gist.github.com/jbowens/6a35612fcdf545d2b2ce835820e7f60f

Not great.

I wonder if the lack of point tombstone compensation is part of the problem. When I was looking at point tombstones in a few sample databases, most point tombstones in the database had already deleted the keys they were intended to delete, so I expected their contribution to space amplification to be minor. I thought it might be a good idea to look at their contribution to read scan latency separately, maybe through triggering compactions based on statistics collected during iteration.

But since historically the RocksDB kMinOverlappingRatio has been a better heuristic for us, maybe that's not true. It's pretty easy to implement RocksDB-style point-tombstone compensation, so I think I'll give that a go just to get some numbers.

Reviewable status: 0 of 17 files reviewed, 10 unresolved discussions (waiting on @jbowens and @petermattis)

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's benchstat output for some roachtests of the original SHA: https://gist.github.com/jbowens/6a35612fcdf545d2b2ce835820e7f60f

Not great.

Unfortunately, run-to-run can be significant. I often run each benchmark 20 times each. 😬

Reviewable status: 0 of 17 files reviewed, 7 unresolved discussions (waiting on @jbowens and @petermattis)


table_stats.go, line 36 at r2 (raw file):

// When an existing database is opened, all files lack in-memory statistics.
// These files' stats are loaded incrementally whenever the pending list is
// empty by scanning a readState for files missing statistics. Once a job

Which readState is scanned? Hopefully we load a new readState on each increment, otherwise we could pin a single readState for too long.


table_stats.go, line 45 at r2 (raw file):

// version the files appeared in.
type newFilesVersion struct {
	rs    *readState

I'm a little bit wary of this usage of readState. A readState pins the associated memtable and sstables and prevent them from getting freed. If the stats update is falling behind for some reason, this could lead to excessive memory or disk usage. Doing the stats update synchronously after flush/compaction/ingestion would avoid this worry, but your concern about how this affects compaction speed is valid. So which is the bigger concern? Alternately, we could have stats collection run on whatever the current read state is, instead of the read state at the time the compaction occurred.


table_stats.go, line 62 at r2 (raw file):

		return
	}

Is it worth checking to see if any of the new file entries need stats collection?


version_set.go, line 454 at r1 (raw file):

I think the fact that an input file might be split across multiple output files makes it difficult (impossible?). It's also an obstacle to tracking dropped keys, because whenever you have a file-level property, how do you split it across multiple outputs?

Yeah, this suggestion wasn't thought through.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch from 403dc06 to b2f8ce0 Compare June 9, 2020 16:30
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 7 unresolved discussions (waiting on @petermattis)


table_stats.go, line 92 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This isn't accounting for L0 sublevels. I'm not sure if it needs to, only pointing out that is true. I think we should call this out in the TableStats doc comments.

Done.


table_stats.go, line 148 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you could also continue if the previous tombstone's end user-key equals this tombstone's start user-key. This would effectively merge abutting range tombstones and reduce the number of loops below. I'm not sure how important this is in practice, or just something that occurs in tests.

Good idea. Done.


table_stats.go, line 36 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Which readState is scanned? Hopefully we load a new readState on each increment, otherwise we could pin a single readState for too long.

Yeah, currently it's a new readState every time.


table_stats.go, line 45 at r2 (raw file):
Yeah, I hear you. If we go this asynchronous route, it seems like it might be more appropriate to ref the underlying version instead of a readState so that we don't pin memtables, just sstables.

Alternately, we could have stats collection run on whatever the current read state is, instead of the read state at the time the compaction occurred.

The tricky bit here is that the pending sstable might conceivably have been moved from a move compaction. If the table moved, we no longer know its level in the LSM. We might accidentally count itself in the size estimate, or worse tables above it if the table moved multiple levels down and the level above filled in. Maybe incorporating the table's largest seqnum into the size estimate scan could solve that. When estimating size, we skip the table itself and any tables with smallest seqnums strictly larger than the table's largest seqnum.


table_stats.go, line 62 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is it worth checking to see if any of the new file entries need stats collection?

Yeah. It's currently structured like this so that I can play with RocksDB-style cumulative point tombstone statistics. I was thinking we accumulate running totals like RocksDB, we'll want to record those in the stats collection job even if the tables already had their stats populated at their creation. On second thought though, moved files would be double-counted that way, so maybe something else is necessary. I'll fix it up.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch from b2f8ce0 to fd14db8 Compare June 9, 2020 22:05
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 7 unresolved discussions (waiting on @petermattis)


table_stats.go, line 45 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I hear you. If we go this asynchronous route, it seems like it might be more appropriate to ref the underlying version instead of a readState so that we don't pin memtables, just sstables.

Alternately, we could have stats collection run on whatever the current read state is, instead of the read state at the time the compaction occurred.

The tricky bit here is that the pending sstable might conceivably have been moved from a move compaction. If the table moved, we no longer know its level in the LSM. We might accidentally count itself in the size estimate, or worse tables above it if the table moved multiple levels down and the level above filled in. Maybe incorporating the table's largest seqnum into the size estimate scan could solve that. When estimating size, we skip the table itself and any tables with smallest seqnums strictly larger than the table's largest seqnum.

I realized the case of the move actually isn't too tricky. If the file was moved, we will encounter it in one of the lower levels. We can clear our current estimate, and continue in any remaining even-lower levels. I added this behavior so we no longer hold readStates except for the current one while a stats collection job is running.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 2 times, most recently from eabb95f to dc13f65 Compare June 9, 2020 22:55
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, but I'd like to see more testing. In particular, estimateSizeBeneath seems to have some cases which aren't tested, such as a file being moved and a file being deleted. (I also just noticed that DB.EstimateDiskUsage is not tested, which is an oversight that should be correct, though not in this PR).

The range tombstone merging logic in loadTableStats deserves testing. I think it would be worthwhile to restructure how loadTableStats is implemented so that we can test the range tombstone iteration/merging logic in isolation. I'm imagining a helper function which takes a rangedel iterator and calls a closure for each range that we need to estimate the size for. In the actual implementation the closure would call estimateSizeBeneath.

Are the existing tests verifying that table stats are updated after flush, compaction, and ingestion? I think yes for flush, maybe for compaction, and no for ingestion.

Reviewable status: 0 of 17 files reviewed, 3 unresolved discussions (waiting on @jbowens and @petermattis)


table_stats.go, line 114 at r3 (raw file):

	defer d.mu.Unlock()
	d.mu.tableStats.loading = false
	d.mu.tableStats.loadedInitial = loadedInitial

I wonder if we should add an EventListener event for when all of the table stats have been loaded. Something like EventListener.TableStatsLoaded. I mention this because it seems like an interesting event, particularly on very large DBs.


table_stats.go, line 250 at r3 (raw file):

				return err
			}
			stats.RangeDeletionsBytesEstimate += estimate

Is there a reason to be using += here? Assignment seems safer if we accidentally try to update the stats for a table twice.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 5 times, most recently from a79ca7a to 5075f79 Compare June 12, 2020 19:59
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the handling of moved and deleted files to detect these cases in loadNewFileStats instead, because these cases are only possible for files appended to the pending list and handling deleted files within loadTableStats felt precarious. I added a (*Version).Contains method for determining whether a level contains a file. This PR ballooned a bit, so I'd be happy to pull that out into a separate PR or commit if it'd be helpful.

I think it would be worthwhile to restructure how loadTableStats is implemented so that we can test the range tombstone iteration/merging logic in isolation.

Good suggestion. Done.

Reviewable status: 0 of 24 files reviewed, 3 unresolved discussions (waiting on @petermattis)


table_stats.go, line 114 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should add an EventListener event for when all of the table stats have been loaded. Something like EventListener.TableStatsLoaded. I mention this because it seems like an interesting event, particularly on very large DBs.

Yeah, that's true. We don't really have any info to report alongside it, just the fact that it happened. I added a TableStatsLoaded event in a separate commit.


table_stats.go, line 250 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is there a reason to be using += here? Assignment seems safer if we accidentally try to update the stats for a table twice.

stats was a variable local to loadTableStats, but I updated it to be unambiguous by aggregating into a totalRangeDeletionEstimate local variable.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the handling of moved and deleted files to detect these cases in loadNewFileStats instead, because these cases are only possible for files appended to the pending list and handling deleted files within loadTableStats felt precarious. I added a (*Version).Contains method for determining whether a level contains a file. This PR ballooned a bit, so I'd be happy to pull that out into a separate PR or commit if it'd be helpful.

It is fine to leave this in this PR.

The additional testing looks good modulo a few small comments.

:lgtm:

Reviewable status: 0 of 24 files reviewed, 9 unresolved discussions (waiting on @jbowens)


event.go, line 221 at r5 (raw file):

// TableStatsInfo contains the info for a table stats loaded event.
type TableStatsInfo struct {
	// JobID is the ID of the job the finished loading the initial tables'

Nit: s/the job the/the job that/g


table_stats.go, line 250 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

stats was a variable local to loadTableStats, but I updated it to be unambiguous by aggregating into a totalRangeDeletionEstimate local variable.

Ah, got it. I was misunderstanding the code, but this looks good now.


table_stats.go, line 220 at r5 (raw file):

func (d *DB) loadTableStats(v *version, level int, meta *fileMetadata) (manifest.TableStats, error) {
	var stats manifest.TableStats

Is stats used anywhere in the closures below? I think it could be moved down closer to the return from this method.


table_stats_test.go, line 1 at r4 (raw file):

package pebble

Nit: Missing copyright header.


table_stats_test.go, line 92 at r4 (raw file):

}

func TestForeachDefragmentedTombstone(t *testing.T) {

I could see this being a datadriven test instead of a table-driven test. That could make the test cases somewhat more readable, though I'm not seeing any cases I'd want added so perhaps it isn't worth the effort.


table_stats_test.go, line 156 at r4 (raw file):

}

type testRangeDelIter struct {

I think you could use rangedel.Iter here instead. See internal/rangedel.NewIter.


internal/manifest/version.go, line 383 at r4 (raw file):

func (v *Version) Contains(level int, cmp Compare, m *FileMetadata) bool {
	if level == 0 {
		for _, f := range v.Files[level] {

Might be faster to do the binary search thing in the sublevel, though this is more obviously correct.


internal/manifest/version.go, line 394 at r4 (raw file):

	i := sort.Search(len(files), func(i int) bool {
		return cmp(files[i].Smallest.UserKey, m.Smallest.UserKey) >= 0
	})

I think this would be clearer if you used Version.Overlaps. That will return a slice of *FileMetadata which you can iterate over just as you're doing below.


testdata/table_stats, line 81 at r5 (raw file):

----

table-stats

It wasn't obvious to me looking at the test case that table-stats would wait for the stats to be loaded. Perhaps this should be renamed wait-table-stats.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch from 5075f79 to 578f225 Compare June 13, 2020 17:23
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 24 files reviewed, 7 unresolved discussions (waiting on @jbowens and @petermattis)


table_stats.go, line 220 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is stats used anywhere in the closures below? I think it could be moved down closer to the return from this method.

Good call. Done.


table_stats_test.go, line 156 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you could use rangedel.Iter here instead. See internal/rangedel.NewIter.

Oh, nice, missed that. Done.


internal/manifest/version.go, line 394 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this would be clearer if you used Version.Overlaps. That will return a slice of *FileMetadata which you can iterate over just as you're doing below.

Done.


testdata/table_stats, line 81 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It wasn't obvious to me looking at the test case that table-stats would wait for the stats to be loaded. Perhaps this should be renamed wait-table-stats.

Done.

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch 4 times, most recently from 9a62ed1 to 5ecc584 Compare June 15, 2020 14:21
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petermattis I added a couple things that came out of test flakes that it'd be good to get your opinion on before merging:

TestCloseCleanerRace was occasionally failing. When I reproduced it locally, the table stats collector would unref its version, causing a delete obsolete files job to run. Then the test would close its iterator, which would see an active cleaning job and not schedule a new one, leaving an obsolete file un-cleaned. I added a call to the end of Close to clean up any remaining obsolete files before returning.

TestRangeDel was occasionally failing when the table stats collection would trigger an automatic compaction of a table w/ a compensated size. I added a private option to disable automatic compactions specifically. I'm not a fan of accumulating a lot of these private options, but it allows us to continue testing table stats and have deterministic test output.

Reviewable status: 0 of 24 files reviewed, 7 unresolved discussions (waiting on @jbowens and @petermattis)

@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch from 5ecc584 to f810c6c Compare June 16, 2020 18:18
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 24 files reviewed, 3 unresolved discussions (waiting on @jbowens and @petermattis)


options.go, line 438 at r7 (raw file):

	// A private option to disable automatic compactions. Only used by tests.
	disableAutomaticCompactions bool

I wonder if we should group these private options in a private struct { ... } field.

Implement a variant of RocksDB's min-overlapping ratio compaction heuristic,
but computing compensated file sizes using estimates of recoverable disk
space from range deletions. Introduce a table stats collection job that
asynchronously loads and calculates table statistics, perserving them on
the FileMetadata.
@jbowens jbowens force-pushed the jackson/minoverlap-rangedel branch from f810c6c to 3b7f3ba Compare June 17, 2020 14:04
@jbowens jbowens merged commit 3b241b7 into cockroachdb:master Jun 17, 2020
@jbowens
Copy link
Copy Markdown
Contributor Author

jbowens commented Jun 17, 2020

TFTR!

@jbowens jbowens deleted the jackson/minoverlap-rangedel branch June 17, 2020 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants