store: order-preserving varint key encoding#5771
Conversation
This is neither efficient nor convenient. For example, if there are five different key types using heights as part of the key, we now have to iterate over five times as many entries (which is five times as slow and causes five times as much IO), and we have to filter out all the keys we don't want. It can also have consequences for the IO characteristics of LSM-tree datastores such as LevelDB that try to lay out on-disk data in key order. Use prefixes, but make sure they are non-overlapping to avoid different key types conflicting with each other (either by using fixed-size prefixes or by using a key segment separator that is guaranteed to not be part of actual segments, which typically involves some sort of escaping scheme for segment encoding). We may want to consider just using an existing encoding instead of inventing one ourselves and getting it wrong. Here's an example: https://github.com/google/orderedcode
I don't know if varints are generally lexicographically ordered, would have to research it. Little endian integers are not lexicographically ordered for large values at least. |
|
I did think about this. My thought process around using the height as the prefix instead of the suffix, and of course I do lack a profound understanding on exactly how these database engines work (even harder when we are trying to support several), is that we want to order keys to improve performance with range operations. The only range operation we do where we can use an iterator is with pruning (as far as I understand it). With pruning we are removing the entire block for that height so having the different types (commit, partset, block meta) together means we can just iterate through the heights. I guess alternatively, we could run an iterator for each of the types and make sure we step through them together when pruning. I'm also of the opinion that we shouldn't optimise these things for pruning but I'm not sure how ordering can improve the other operations that the blockstore or statestore does (can it lead to faster reads?). I would also be for standardising our encoding system across all databases. When I had a glimpse they all had slightly different ways of representing the key. |
I get your reasoning, but the reason why we're only using iterators when pruning is precisely because we don't have an ordered key encoding, so we can't use iterators for anything (because we would just get things in the wrong order). There's a bunch of places we should be using iterators but aren't currently. Every time we fetch a range of something, or the first or last item of a sequence, we should most likely be using an iterator. This includes e.g. RPC endpoints (such as /blockchain), base/height metadata for the block store, validator set checkpointing, and many other cases. In most cases, we are usually only interested in one type of data (for example, in the /blockchain RPC we are only interested in blocks). Having to iterate over and filter out all of the unrelated keys is inefficient and inconvenient. It is more natural to use prefixes, since this allows us to iterate over each type in isolation. Efficiency-wise this is almost always a win. Convenience-wise this is usually a win (but as you say, pruning is a counterexample). You should also consider how we would store data that doesn't have a height with the suffix scheme -- since integers span the entire byte space, it would be impossible to add a different key of more than 8 bytes and not have it conflict with integers. These random keys would then also turn up when iterating. As a contrived example, the string
Yes, it's always faster to iterate over 100 keys than it is to fetch each key as a separate operation. This depends on the underlying database, but usually key lookup is O(log n) while iterating is O(1) per step. It also allows the database to optimize disk access by e.g. fetching neighboring keys stored in the same page or prefetching keys.
I agree. In particular, we must take care that keys will not overlap (so that e.g. validator sets suddenly turn up in the middle of block keys). |
|
Btw, we should also keep #2907 in mind (although I don't know if it would make sense to use the same encoding there). We should use a key-encoding that allows key segments to include arbitrary bytes (which needs some sort of escaping scheme for segment separators). For example, if we choose the separator to be |
I had a look at the light client which uses an iterator for the first and last light block, but we don't currently have these methods in the blockstore (they could be useful for something like the RPC or for fast sync). We just keep an in memory record of the
Yeah this also crossed my mind. With the blockstore we also map block hashes to heights (where the hash is the key). So I thought it could be possible that a hash finds itself between two heights and could get accidentally removed in the process hence I did actually add a prefix one for height related data types and another for the block hashes.
Okay I think I was designing too specifically for pruning. What I can do is group them back to their data types (which can be a single byte prefix). I can also use I don't think this will be varint however (I would have to play around a little more to see). My understanding is that having varint reduces the total size of all the keys which allows db's to store more of them in memory. However another part of my brain says that having fixed key sizes would make it easier to iterate through the keys (cause you know where to start reading the next key).
but with tendermint's dbs we know in advance what the keys are going to be so I don't think allowing for arbitrary bytes is necessary. To be honest, I haven't looked into the TxIndexer so I'm not sure what's really required for it |
We'd have to look into what makes sense in each case. But for the blockstore base and height case, we currently store this metadata explicitly (via e.g. SaveBlockStoreState) and keep updating it, which is unnecessary and brittle - we could instead just check what the actual first and last blocks are directly using an iterator (iterating over one block), and remove all of the block store state stuff.
No, this isn't varint, it's just the big endian layout. It is ordered, but just takes up a lot of space. The
That's true, fixed-size segments are easier because you know where they start and end, so when using variable-length segments there needs to be some (order-preserving!) way of knowing where the segment ends.
With the txindexer the key will include arbitrary byte slices produced by the application (specifically As you can probably tell, this is the sort of problem that initially appears simple, but gets complicated real fast. Which is why I think we should consider just adopting some existing encoding rather than reinventing one. |
I agree. I skimmed through the Outside of this, the other actionable would be as you said to replace the "base" and "height" stuff in Blockstore with methods that use the iterator. I had a look at the other db's and they seem to be fine in this regard |
I don't know, I've only looked at it for like 3 minutes, but it seems reasonable. Something to consider is whether the encoding will ever be exposed to other systems. In that case, it would make sense to use an encoding that has support in many languages, is well-documented, and/or is trivial to implement. This might happen e.g. with the txindexer (I don't know much about it or what the use-cases are), so it might be a good idea to look into that first. As long as all use of the encoding is internal, I'd say something like
There's probably a ton of stuff that can be improved by using iterators, but we'd have to review the codebase to find everything. I'd just change the encoding for now, and submit separate PRs for any other improvements -- either as a concerted effort, or just opportunistically as we find them. |
In a way isn't it already exposed. There's nothing stopping me from running a separate process that queries |
Yeah, but I'd consider this an unofficial interface, and as such people are on their own if they want to interact with it. The official interface would be RPC. However, with e.g. txindexer I don't know if we officially support the use-case of exporting this data into some external datastore (e.g. PostgreSQL) for the explicit purpose of having some other application consume that data. |
|
I would, at a guess, say no. Do you think it would be a service that would be worth offering? |
I think this is totally plausible and I've seen proposals of various sorts (from vulanizeDB I think?) that aim to consume indexed data and pipe it to some external set of services. Note, in the SDK we use unique key prefixes for store keys with a |
|
Yeah, this has bitten people before: tendermint/tm-db#6 |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
erikgrinaker
left a comment
There was a problem hiding this comment.
Looks good overall, left some minor comments.
One unresolved issue from before is the prefix handling. You're currently both using the raw prefix, e.g. here:
orderedcode.Append([]byte{prefixSeenCommit}, height)But also using the encoded prefixes, e.g. here:
orderedcode.Append(buf, s.prefix, string(prefixLightBlock), height)I think we should pick one approach and stick with it. I'd probably choose the latter approach, since this allows decoding the entire key slice with orderedcode without chopping off the prefix(es) ourselves. However, in that case I think we should use uints rather than strings for the constant prefixes, since these encode more compactly.
store/store.go
Outdated
| func calcBlockCommitKey(height int64) []byte { | ||
| return []byte(fmt.Sprintf("C:%v", height)) | ||
| func blockCommitKey(height int64) []byte { | ||
| key, err := orderedcode.Append([]byte{prefixBlockCommit}, height) |
There was a problem hiding this comment.
Does int64 and uint64 encode differently? If so, we may want to pre-emptively use uint64 for heights, since I think we'll want to migrate to that in the future.
There was a problem hiding this comment.
Yeah they are encoded differently. Strangely int64 is more efficient than uint64 in terms of size (in most cases)
There was a problem hiding this comment.
height: 1000000 int64: [11101111 1000010 1000000] (len: 3), uint64: [11 1111 1000010 1000000] (len: 4)
There was a problem hiding this comment.
Probably because int is used more often than uint, so they're optimizing for the typical case. Also, 1000000 is a fairly large value, I'm guessing there's no overhead for smaller values.
There was a problem hiding this comment.
uint always has an extra byte for the length so there is always overhead. signed int has a different strategy where the length is built in to the first byte (n amount of 1's followed by a 0 indicating the length of bytes to come) making it generally more efficient. I don't know why uint couldn't have followed the same strategy
There was a problem hiding this comment.
Seems so, that's unfortunate. 😞
evidence/pool.go
Outdated
|
|
||
| func keySuffix(evidence types.Evidence) []byte { | ||
| return []byte(fmt.Sprintf("%s/%X", bE(evidence.Height()), evidence.Hash())) | ||
| key, err := orderedcode.Append([]byte{prefixPending}, evidence.Height(), string(evidence.Hash())) |
There was a problem hiding this comment.
We need to be careful with int64 and uint64 here, since changing the return value of evidence.Height() may cause the on-disk format to break. Might want to cast to uint64 to make sure it's stable, unless orderedcode can decode them interchangably.
There was a problem hiding this comment.
What would you think about doing the opposite. Leaving it as is and when we change to uint64 we make sure we cast the height to int64. Int64 encoding has a different strategy to uint64 encoding which makes it almost always more or at least as efficient as uint64.
If you want I could deliberately cast it to int64 now so whoever makes the change to uint64 should recognise that this needs to remain as int64
There was a problem hiding this comment.
int64 can't store the full range of uint64. While overflow is unlikely, these sorts of impedance mismatches are a fertile source of bugs. I'd say either stick with int64 while we're using int64 for heights and then change it later, or go with uint64 if that's the long-term plan.
There was a problem hiding this comment.
If/when we change from int64 to uint64 this will be a breaking change to the db regardless so would it not make sense to change it all then (I'm not particularly attached to either approach).
There was a problem hiding this comment.
And yes, if we stick with int64 we should make sure that we pass int64 to the key (either by a type assertion or checked cast), so that it won't suddenly break if someone changes the evidence.Height() return type.
|
I can include the prefix within orderedcode. After fiddling around a little, it seems to be more efficient to use int64 instead of uint64 |
Description
This PR:
uses the orderedcode library to lexicographically encode the key's in the state store, block store, evidence pool and light store.
implements an iterator for quicker pruning of data in both the state store and block store.
Removes the blockstore state data type - an in memory record of base and height of the block store - in place for
Base()andHeight()methods that implement an iterator to retrieve the valuesCloses: #4567