Skip to content

storage/engine: add pebbleReadOnly#41859

Merged
petermattis merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/pebble-readonly
Oct 24, 2019
Merged

storage/engine: add pebbleReadOnly#41859
petermattis merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/pebble-readonly

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

Reimplement Pebble.ReadOnly with a specific pebbleReadOnly
implementation. This removes a lot of if p.readOnly checks from
Pebble, and also allows for pebbleReadOnly.NewIterator to use a
reusable iterator. This brings the behavior in line with
rocksDBReadOnly.

Added pebbleIterator.{reusable,inuse,setOptions}. Replaced
pebbleBatchIterator with a pebbleIterator marked as reusable.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis petermattis force-pushed the pmattis/pebble-readonly branch from 3803e62 to af02589 Compare October 23, 2019 15:39
Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, and @sumeerbhola)


pkg/storage/engine/batch_test.go, line 1015 at r2 (raw file):

					// With Pebble, writes to the distinct batch are readable by the
					// distinct batch. This semantic difference is due to not buffering
					// writes in a builder.

I don't think this difference between Pebble and RocksDB distinct batches will matter, but we all should be aware of it.


pkg/storage/engine/batch_test.go, line 1203 at r2 (raw file):

			switch engineImpl.name {
			case "pebble":
				// Reverse iteration in batches works on Pebble.

A less important difference between Pebble and RocksDB. We don't support reverse iteration on RocksDB batches due to implementation limitations with iteration on rocksdb::WriteBatchWithIndex. No such limitation with Pebble, but this isn't something CRDB uses.

@petermattis petermattis force-pushed the pmattis/pebble-readonly branch 3 times, most recently from 4878a5d to 7d17a26 Compare October 23, 2019 18:11
Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some questions / concerns.

Reviewed 2 of 5 files at r1, 2 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @petermattis, and @sumeerbhola)


pkg/storage/engine/pebble_batch.go, line 114 at r4 (raw file):

		}
	} else if !p.batch.Indexed() {
		r = p.db

Is this just to mimic RocksDB behaviour, or do we actually do reads on distinct batches and expect them to fall through to the underlying db?


pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):

	// Use the cached iterator.
	//
	// TODO(itsbilal): Investigate if it's equally or more efficient to just call

Are we confident enough in ditching this TODO? Both performance-wise and correctness wise?

For eg. I just uncovered a minor issue in this area: I noticed that we sometimes end up running into the "iterator already in use" case where we expect a batch to be able to spawn two iterators at once. As a particular example, resolveLocalIntents in batchEval spins up one non-prefix iterator on a batch (the associated Close is in a defer), then calls SetAbortSpan which requests a prefix iterator on the same batch. This is fine in the RocksDB batch case since we cache a prefix and a non-prefix iterator. We could either mimic that setup here and cache two iterators, or just do no caching if the performance impact of that is negligible or none.


pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):

	p.reusable = false
	p.inuse = false
	p.Close()

Calling Close() here with reusable = false is pretty unsafe, since it will put the iterator back in the pebbleIterPool, while the batch it is in will end up in pebbleBatchPool, resulting in cases where two threads could concurrently be using the same pebbleIterator. I actually made the same mistake a couple weeks ago, which is probably a testament to how managing reusable iterators can be confusing and why it's a good idea to consider ditching them altogether.

For now, you can probably just copy the relevant parts of Close (i.e. *p = pebbleIterator{...}) here and that's good enough.

@petermattis petermattis force-pushed the pmattis/pebble-readonly branch from 7d17a26 to cff7cd5 Compare October 23, 2019 21:29
Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, and @sumeerbhola)


pkg/storage/engine/pebble_batch.go, line 114 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Is this just to mimic RocksDB behaviour, or do we actually do reads on distinct batches and expect them to fall through to the underlying db?

We definitely use this for iteration. The change here was made to mimic the RocksDB behavior for tests in batch_test.go. I don't think we actually use this in any production code.


pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Are we confident enough in ditching this TODO? Both performance-wise and correctness wise?

For eg. I just uncovered a minor issue in this area: I noticed that we sometimes end up running into the "iterator already in use" case where we expect a batch to be able to spawn two iterators at once. As a particular example, resolveLocalIntents in batchEval spins up one non-prefix iterator on a batch (the associated Close is in a defer), then calls SetAbortSpan which requests a prefix iterator on the same batch. This is fine in the RocksDB batch case since we cache a prefix and a non-prefix iterator. We could either mimic that setup here and cache two iterators, or just do no caching if the performance impact of that is negligible or none.

Yeah, I'm confident of this performance-wise. Correctness is actually part of the motivation here. Creation of an iterator grabs an implicit snapshot. I'd prefer to mimic the behavior of RocksDB that we're getting by caching the iterator rather than having subtly different semantics.

Can you elaborate on the resolveLocalIntents problem? I haven't seen any test failures due to this. And the change here doesn't actually change the iterator already in use checking.


pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Calling Close() here with reusable = false is pretty unsafe, since it will put the iterator back in the pebbleIterPool, while the batch it is in will end up in pebbleBatchPool, resulting in cases where two threads could concurrently be using the same pebbleIterator. I actually made the same mistake a couple weeks ago, which is probably a testament to how managing reusable iterators can be confusing and why it's a good idea to consider ditching them altogether.

For now, you can probably just copy the relevant parts of Close (i.e. *p = pebbleIterator{...}) here and that's good enough.

Good catch. I've reorganized the code so that Close() now calls destroy() instead. PTAL.

Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @sumeerbhola)


pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yeah, I'm confident of this performance-wise. Correctness is actually part of the motivation here. Creation of an iterator grabs an implicit snapshot. I'd prefer to mimic the behavior of RocksDB that we're getting by caching the iterator rather than having subtly different semantics.

Can you elaborate on the resolveLocalIntents problem? I haven't seen any test failures due to this. And the change here doesn't actually change the iterator already in use checking.

That makes sense, then - let's keep it this way and remove the TODO.

The issue I mentioned would fall under the umbrella of that TODO, but I could address it in a separate PR since it's unrelated to this change. It's not very well tested, but pops up every once in a while on a long running cluster

Just skim through resolveLocalIntents in batcheval/cmd_end_transaction.go, and you'll notice that there's an open batch iterator for the lifetime of that function. There's one case inside it where it could also call SetAbortSpan, which does an MVCCGetProto, which does an MVCCGet, which creates a short-lived prefix iterator on the batch.

In the RocksDB case that's fine since the two are cached independently and we don't end up reusing the same iterator, but in our current implementatino we only cache one iterator, and flip prefix to true as necessary. I guess if preserving semantics is our priority, we probably should just cache two iterators and call it a day.


pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Good catch. I've reorganized the code so that Close() now calls destroy() instead. PTAL.

LGTM. There's a subtlety currently where calling destroy then Close could still leak a reusable iterator into the iterator pool, but looking at the places where we call destroy directly, I can't imagine any cases where it would happen. Plus destroy is unexported, so this seems safe.

Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @sumeerbhola)


pkg/storage/engine/pebble_batch.go, line 146 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That makes sense, then - let's keep it this way and remove the TODO.

The issue I mentioned would fall under the umbrella of that TODO, but I could address it in a separate PR since it's unrelated to this change. It's not very well tested, but pops up every once in a while on a long running cluster

Just skim through resolveLocalIntents in batcheval/cmd_end_transaction.go, and you'll notice that there's an open batch iterator for the lifetime of that function. There's one case inside it where it could also call SetAbortSpan, which does an MVCCGetProto, which does an MVCCGet, which creates a short-lived prefix iterator on the batch.

In the RocksDB case that's fine since the two are cached independently and we don't end up reusing the same iterator, but in our current implementatino we only cache one iterator, and flip prefix to true as necessary. I guess if preserving semantics is our priority, we probably should just cache two iterators and call it a day.

Hmm. Interesting. It might be worth separating the cached prefix and normal iterators like we do for RocksDB. I'll leave this to a follow-on PR.


pkg/storage/engine/pebble_iterator.go, line 519 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

LGTM. There's a subtlety currently where calling destroy then Close could still leak a reusable iterator into the iterator pool, but looking at the places where we call destroy directly, I can't imagine any cases where it would happen. Plus destroy is unexported, so this seems safe.

True. I've add a small mitigation which is to maintain the state of p.reusable. If Close() is subsequently called on a reusable iterator, we'll fall into the if p.reusable block.

Reimplement `Pebble.ReadOnly` with a specific `pebbleReadOnly`
implementation. This removes a lot of `if p.readOnly` checks from
`Pebble`, and also allows for `pebbleReadOnly.NewIterator` to use a
reusable iterator. This brings the behavior in line with
`rocksDBReadOnly`.

Added `pebbleIterator.{reusable,inuse,setOptions}`. Replaced
`pebbleBatchIterator` with a `pebbleIterator` marked as reusable.

Release note: None
@petermattis petermattis force-pushed the pmattis/pebble-readonly branch from cff7cd5 to 8b01e76 Compare October 23, 2019 23:37
@petermattis petermattis merged commit f9a1028 into cockroachdb:master Oct 24, 2019
@petermattis petermattis deleted the pmattis/pebble-readonly branch October 24, 2019 00:27
petermattis added a commit to petermattis/cockroach that referenced this pull request Oct 24, 2019
`pebbleBatch.NewIterator` was setting `pebbleBatch.iter.inuse = true`,
and then calling `pebbleIterator.init` which was clearing that
field. This was broken by cockroachdb#41859 which refactored how `pebbleBatch`
iterator reuse works.

Fixes cockroachdb#41899
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, and @petermattis)


pkg/storage/engine/pebble.go, line 578 at r8 (raw file):

	}

	if opts.MinTimestampHint != (hlc.Timestamp{}) {

is hlc.Timestamp{} the zero timestamp? Is it possible for the above to be false but the MaxTimestampHint to be provided?

Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, and @sumeerbhola)


pkg/storage/engine/pebble.go, line 578 at r8 (raw file):

Previously, sumeerbhola wrote…

is hlc.Timestamp{} the zero timestamp? Is it possible for the above to be false but the MaxTimestampHint to be provided?

Yes, hlc.Timestamp{} is the zero timestamp. That's a Go-ism.

The logic here is copied from the RocksDB. I'm not sure why only MinTimestampHint is checked. That seems fragile, but we always set both timestamps whenever time-bound iteration is requested.

petermattis added a commit to petermattis/cockroach that referenced this pull request Oct 24, 2019
`pebbleBatch.NewIterator` was setting `pebbleBatch.iter.inuse = true`,
and then calling `pebbleIterator.init` which was clearing that
field. This was broken by cockroachdb#41859 which refactored how `pebbleBatch`
iterator reuse works.

Added separate cached prefix and normal iterators to
`pebble{Batch,ReadOnly}`. Various bits of higher-level code expect to be
able to have a prefix and normal iterator open at the same time. In
particularly, this comes up during intent resolution. This also mimics
our usage of RocksDB which seems desirable in the short term even though
the semantics of having two cached iterators is slightly odd.

Fixes cockroachdb#41899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants