Skip to content

store: order-preserving varint key encoding#5771

Merged
cmwaters merged 39 commits intomasterfrom
callum/key-encoding
Jan 5, 2021
Merged

store: order-preserving varint key encoding#5771
cmwaters merged 39 commits intomasterfrom
callum/key-encoding

Conversation

@cmwaters
Copy link
Contributor

@cmwaters cmwaters commented Dec 8, 2020

Description

This PR:

  • uses the orderedcode library to lexicographically encode the key's in the state store, block store, evidence pool and light store.

  • implements an iterator for quicker pruning of data in both the state store and block store.

  • Removes the blockstore state data type - an in memory record of base and height of the block store - in place for Base() and Height() methods that implement an iterator to retrieve the values

Closes: #4567

@cmwaters cmwaters added T:perf Type: Performance T:breaking Type: Breaking Change labels Dec 8, 2020
@cmwaters cmwaters changed the title store: oder preserving varint encoding store: oder preserving varint key encoding Dec 8, 2020
@cmwaters cmwaters changed the title store: oder preserving varint key encoding store: order-preserving varint key encoding Dec 8, 2020
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 8, 2020

I have reordered the keys so that the height is the prefix and the data type is the suffix

This is neither efficient nor convenient. For example, if there are five different key types using heights as part of the key, we now have to iterate over five times as many entries (which is five times as slow and causes five times as much IO), and we have to filter out all the keys we don't want. It can also have consequences for the IO characteristics of LSM-tree datastores such as LevelDB that try to lay out on-disk data in key order.

Use prefixes, but make sure they are non-overlapping to avoid different key types conflicting with each other (either by using fixed-size prefixes or by using a key segment separator that is guaranteed to not be part of actual segments, which typically involves some sort of escaping scheme for segment encoding).

We may want to consider just using an existing encoding instead of inventing one ourselves and getting it wrong. Here's an example: https://github.com/google/orderedcode

does this still mean order is preserved?

I don't know if varints are generally lexicographically ordered, would have to research it. Little endian integers are not lexicographically ordered for large values at least.

@cmwaters
Copy link
Contributor Author

cmwaters commented Dec 9, 2020

I did think about this. My thought process around using the height as the prefix instead of the suffix, and of course I do lack a profound understanding on exactly how these database engines work (even harder when we are trying to support several), is that we want to order keys to improve performance with range operations. The only range operation we do where we can use an iterator is with pruning (as far as I understand it). With pruning we are removing the entire block for that height so having the different types (commit, partset, block meta) together means we can just iterate through the heights.

I guess alternatively, we could run an iterator for each of the types and make sure we step through them together when pruning. I'm also of the opinion that we shouldn't optimise these things for pruning but I'm not sure how ordering can improve the other operations that the blockstore or statestore does (can it lead to faster reads?).

I would also be for standardising our encoding system across all databases. When I had a glimpse they all had slightly different ways of representing the key.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 9, 2020

The only range operation we do where we can use an iterator is with pruning (as far as I understand it). With pruning we are removing the entire block for that height so having the different types (commit, partset, block meta) together means we can just iterate through the heights.

I get your reasoning, but the reason why we're only using iterators when pruning is precisely because we don't have an ordered key encoding, so we can't use iterators for anything (because we would just get things in the wrong order).

There's a bunch of places we should be using iterators but aren't currently. Every time we fetch a range of something, or the first or last item of a sequence, we should most likely be using an iterator. This includes e.g. RPC endpoints (such as /blockchain), base/height metadata for the block store, validator set checkpointing, and many other cases.

In most cases, we are usually only interested in one type of data (for example, in the /blockchain RPC we are only interested in blocks). Having to iterate over and filter out all of the unrelated keys is inefficient and inconvenient. It is more natural to use prefixes, since this allows us to iterate over each type in isolation. Efficiency-wise this is almost always a win. Convenience-wise this is usually a win (but as you say, pruning is a counterexample).

You should also consider how we would store data that doesn't have a height with the suffix scheme -- since integers span the entire byte space, it would be impossible to add a different key of more than 8 bytes and not have it conflict with integers. These random keys would then also turn up when iterating. As a contrived example, the string metadata is 6d65746164617461 as hex, which is equivalent to the integer 7882834684426679393 -- in principle it's impossible to know whether this is height 7882834684426679393 or the metadata key, so we wouldn't know to stop iterating before we reach it.

I'm not sure how ordering can improve the other operations that the blockstore or statestore does (can it lead to faster reads?)

Yes, it's always faster to iterate over 100 keys than it is to fetch each key as a separate operation. This depends on the underlying database, but usually key lookup is O(log n) while iterating is O(1) per step. It also allows the database to optimize disk access by e.g. fetching neighboring keys stored in the same page or prefetching keys.

I would also be for standardising our encoding system across all databases. When I had a glimpse they all had slightly different ways of representing the key.

I agree. In particular, we must take care that keys will not overlap (so that e.g. validator sets suddenly turn up in the middle of block keys).

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 9, 2020

Btw, we should also keep #2907 in mind (although I don't know if it would make sense to use the same encoding there). We should use a key-encoding that allows key segments to include arbitrary bytes (which needs some sort of escaping scheme for segment separators). For example, if we choose the separator to be / we would have to convert / in segments into \/ and convert \ into \\, and take this into account when parsing the key -- however, this escaping scheme does not preserve order and could not be used. These things are tricky to get right.

@cmwaters
Copy link
Contributor Author

cmwaters commented Dec 9, 2020

There's a bunch of places we should be using iterators but aren't currently.

I had a look at the light client which uses an iterator for the first and last light block, but we don't currently have these methods in the blockstore (they could be useful for something like the RPC or for fast sync). We just keep an in memory record of the base() and height(). Do we want to add these methods to the blockstore?

You should also consider how we would store data that doesn't have a height

Yeah this also crossed my mind. With the blockstore we also map block hashes to heights (where the hash is the key). So I thought it could be possible that a hash finds itself between two heights and could get accidentally removed in the process hence I did actually add a prefix one for height related data types and another for the block hashes.

Having to iterate over and filter out all of the unrelated keys is inefficient and inconvenient. It is more natural to use prefixes,

Okay I think I was designing too specifically for pruning. What I can do is group them back to their data types (which can be a single byte prefix). I can also use binary.BigEndian.PutUint64(buf, height) so that the heights are stored in the "correct" order.

I don't think this will be varint however (I would have to play around a little more to see). My understanding is that having varint reduces the total size of all the keys which allows db's to store more of them in memory. However another part of my brain says that having fixed key sizes would make it easier to iterate through the keys (cause you know where to start reading the next key).

We should use a key-encoding that allows key segments to include arbitrary bytes

but with tendermint's dbs we know in advance what the keys are going to be so I don't think allowing for arbitrary bytes is necessary. To be honest, I haven't looked into the TxIndexer so I'm not sure what's really required for it

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 9, 2020

I had a look at the light client which uses an iterator for the first and last light block, but we don't currently have these methods in the blockstore (they could be useful for something like the RPC or for fast sync). We just keep an in memory record of the base() and height(). Do we want to add these methods to the blockstore?

We'd have to look into what makes sense in each case. But for the blockstore base and height case, we currently store this metadata explicitly (via e.g. SaveBlockStoreState) and keep updating it, which is unnecessary and brittle - we could instead just check what the actual first and last blocks are directly using an iterator (iterating over one block), and remove all of the block store state stuff.

What I can do is group them back to their data types (which can be a single byte prefix). I can also use binary.BigEndian.PutUint64(buf, height) so that the heights are stored in the "correct" order. I don't think this will be varint however.

No, this isn't varint, it's just the big endian layout. It is ordered, but just takes up a lot of space. The orderedcode project I mentioned earlier compresses integers and preserves order.

However another part of my brain says that having fixed key sizes would make it easier to iterate through the keys (cause you know where to start reading the next key).

That's true, fixed-size segments are easier because you know where they start and end, so when using variable-length segments there needs to be some (order-preserving!) way of knowing where the segment ends.

but with tendermint's dbs we know in advance what the keys are going to be so I don't think allowing for arbitrary bytes is necessary. To be honest, I haven't looked into the TxIndexer so I'm not sure what's really required for it

With the txindexer the key will include arbitrary byte slices produced by the application (specifically tag.Key and tag.Value), and we have no idea what those will contain.

As you can probably tell, this is the sort of problem that initially appears simple, but gets complicated real fast. Which is why I think we should consider just adopting some existing encoding rather than reinventing one.

@cmwaters
Copy link
Contributor Author

cmwaters commented Dec 9, 2020

As you can probably tell, this is the sort of problem that initially appears simple, but gets complicated real fast. Which is why I think we should consider just adopting some existing encoding rather than reinventing one.

I agree. I skimmed through the orderedcode library. This seems sufficient for what we're looking for right? Would we want to consider other encoding libraries.

Outside of this, the other actionable would be as you said to replace the "base" and "height" stuff in Blockstore with methods that use the iterator. I had a look at the other db's and they seem to be fine in this regard

@erikgrinaker
Copy link
Contributor

I agree. I skimmed through the orderedcode library. This seems sufficient for what we're looking for right? Would we want to consider other encoding libraries.

I don't know, I've only looked at it for like 3 minutes, but it seems reasonable.

Something to consider is whether the encoding will ever be exposed to other systems. In that case, it would make sense to use an encoding that has support in many languages, is well-documented, and/or is trivial to implement. This might happen e.g. with the txindexer (I don't know much about it or what the use-cases are), so it might be a good idea to look into that first. As long as all use of the encoding is internal, I'd say something like orderedcode makes sense.

Outside of this, the other actionable would be as you said to replace the "base" and "height" stuff in Blockstore with methods that use the iterator. I had a look at the other db's and they seem to be fine in this regard

There's probably a ton of stuff that can be improved by using iterators, but we'd have to review the codebase to find everything. I'd just change the encoding for now, and submit separate PRs for any other improvements -- either as a concerted effort, or just opportunistically as we find them.

@cmwaters
Copy link
Contributor Author

cmwaters commented Dec 9, 2020

Something to consider is whether the encoding will ever be exposed to other systems.

In a way isn't it already exposed. There's nothing stopping me from running a separate process that queries blockstore.db right (I just have to make sure I encode the key in the same way)? This could be an option that some take for example if they want to implement their own custom querier for the txs in a blockchain

@erikgrinaker
Copy link
Contributor

In a way isn't it already exposed. There's nothing stopping me from running a separate process that queries blockstore.db right (I just have to make sure I encode the key in the same way)? This could be an option that some take for example if they want to implement their own custom querier for the txs in a blockchain

Yeah, but I'd consider this an unofficial interface, and as such people are on their own if they want to interact with it. The official interface would be RPC.

However, with e.g. txindexer I don't know if we officially support the use-case of exporting this data into some external datastore (e.g. PostgreSQL) for the explicit purpose of having some other application consume that data.

@cmwaters
Copy link
Contributor Author

cmwaters commented Dec 9, 2020

I would, at a guess, say no. Do you think it would be a service that would be worth offering?

@alexanderbez
Copy link
Contributor

alexanderbez commented Dec 9, 2020

However, with e.g. txindexer I don't know if we officially support the use-case of exporting this data into some external datastore (e.g. PostgreSQL) for the explicit purpose of having some other application consume that data.

I think this is totally plausible and I've seen proposals of various sorts (from vulanizeDB I think?) that aim to consume indexed data and pipe it to some external set of services.

Note, in the SDK we use unique key prefixes for store keys with a / as a separator, however we don't escape or ensure all prefixes are the same length 😮

@erikgrinaker
Copy link
Contributor

Yeah, this has bitten people before: tendermint/tm-db#6

@github-actions
Copy link

github-actions bot commented Jan 3, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Jan 3, 2021
@cmwaters cmwaters removed the stale for use by stalebot label Jan 4, 2021
@cmwaters cmwaters marked this pull request as ready for review January 4, 2021 12:51
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, left some minor comments.

One unresolved issue from before is the prefix handling. You're currently both using the raw prefix, e.g. here:

orderedcode.Append([]byte{prefixSeenCommit}, height)

But also using the encoded prefixes, e.g. here:

orderedcode.Append(buf, s.prefix, string(prefixLightBlock), height)

I think we should pick one approach and stick with it. I'd probably choose the latter approach, since this allows decoding the entire key slice with orderedcode without chopping off the prefix(es) ourselves. However, in that case I think we should use uints rather than strings for the constant prefixes, since these encode more compactly.

store/store.go Outdated
func calcBlockCommitKey(height int64) []byte {
return []byte(fmt.Sprintf("C:%v", height))
func blockCommitKey(height int64) []byte {
key, err := orderedcode.Append([]byte{prefixBlockCommit}, height)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does int64 and uint64 encode differently? If so, we may want to pre-emptively use uint64 for heights, since I think we'll want to migrate to that in the future.

Copy link
Contributor Author

@cmwaters cmwaters Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they are encoded differently. Strangely int64 is more efficient than uint64 in terms of size (in most cases)

Copy link
Contributor Author

@cmwaters cmwaters Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

height: 1000000 int64: [11101111 1000010 1000000] (len: 3), uint64: [11 1111 1000010 1000000] (len: 4)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because int is used more often than uint, so they're optimizing for the typical case. Also, 1000000 is a fairly large value, I'm guessing there's no overhead for smaller values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint always has an extra byte for the length so there is always overhead. signed int has a different strategy where the length is built in to the first byte (n amount of 1's followed by a 0 indicating the length of bytes to come) making it generally more efficient. I don't know why uint couldn't have followed the same strategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems so, that's unfortunate. 😞

evidence/pool.go Outdated

func keySuffix(evidence types.Evidence) []byte {
return []byte(fmt.Sprintf("%s/%X", bE(evidence.Height()), evidence.Hash()))
key, err := orderedcode.Append([]byte{prefixPending}, evidence.Height(), string(evidence.Hash()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful with int64 and uint64 here, since changing the return value of evidence.Height() may cause the on-disk format to break. Might want to cast to uint64 to make sure it's stable, unless orderedcode can decode them interchangably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about doing the opposite. Leaving it as is and when we change to uint64 we make sure we cast the height to int64. Int64 encoding has a different strategy to uint64 encoding which makes it almost always more or at least as efficient as uint64.

If you want I could deliberately cast it to int64 now so whoever makes the change to uint64 should recognise that this needs to remain as int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64 can't store the full range of uint64. While overflow is unlikely, these sorts of impedance mismatches are a fertile source of bugs. I'd say either stick with int64 while we're using int64 for heights and then change it later, or go with uint64 if that's the long-term plan.

Copy link
Contributor Author

@cmwaters cmwaters Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If/when we change from int64 to uint64 this will be a breaking change to the db regardless so would it not make sense to change it all then (I'm not particularly attached to either approach).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, if we stick with int64 we should make sure that we pass int64 to the key (either by a type assertion or checked cast), so that it won't suddenly break if someone changes the evidence.Height() return type.

@cmwaters
Copy link
Contributor Author

cmwaters commented Jan 4, 2021

I can include the prefix within orderedcode. After fiddling around a little, it seems to be more efficient to use int64 instead of uint64

@cmwaters cmwaters requested a review from erikgrinaker January 5, 2021 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T:breaking Type: Breaking Change T:perf Type: Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use an order-preserving, scan-friendly database key encoding

5 participants