store: order-preserving varint key encoding by cmwaters · Pull Request #5771 · tendermint/tendermint

cmwaters · 2020-12-08T16:23:42Z

Description

This PR:

uses the orderedcode library to lexicographically encode the key's in the state store, block store, evidence pool and light store.
implements an iterator for quicker pruning of data in both the state store and block store.
Removes the blockstore state data type - an in memory record of base and height of the block store - in place for Base() and Height() methods that implement an iterator to retrieve the values

Closes: #4567

erikgrinaker · 2020-12-08T16:42:25Z

I have reordered the keys so that the height is the prefix and the data type is the suffix

This is neither efficient nor convenient. For example, if there are five different key types using heights as part of the key, we now have to iterate over five times as many entries (which is five times as slow and causes five times as much IO), and we have to filter out all the keys we don't want. It can also have consequences for the IO characteristics of LSM-tree datastores such as LevelDB that try to lay out on-disk data in key order.

Use prefixes, but make sure they are non-overlapping to avoid different key types conflicting with each other (either by using fixed-size prefixes or by using a key segment separator that is guaranteed to not be part of actual segments, which typically involves some sort of escaping scheme for segment encoding).

We may want to consider just using an existing encoding instead of inventing one ourselves and getting it wrong. Here's an example: https://github.com/google/orderedcode

does this still mean order is preserved?

I don't know if varints are generally lexicographically ordered, would have to research it. Little endian integers are not lexicographically ordered for large values at least.

cmwaters · 2020-12-09T08:56:37Z

I did think about this. My thought process around using the height as the prefix instead of the suffix, and of course I do lack a profound understanding on exactly how these database engines work (even harder when we are trying to support several), is that we want to order keys to improve performance with range operations. The only range operation we do where we can use an iterator is with pruning (as far as I understand it). With pruning we are removing the entire block for that height so having the different types (commit, partset, block meta) together means we can just iterate through the heights.

I guess alternatively, we could run an iterator for each of the types and make sure we step through them together when pruning. I'm also of the opinion that we shouldn't optimise these things for pruning but I'm not sure how ordering can improve the other operations that the blockstore or statestore does (can it lead to faster reads?).

I would also be for standardising our encoding system across all databases. When I had a glimpse they all had slightly different ways of representing the key.

erikgrinaker · 2020-12-09T09:34:21Z

The only range operation we do where we can use an iterator is with pruning (as far as I understand it). With pruning we are removing the entire block for that height so having the different types (commit, partset, block meta) together means we can just iterate through the heights.

I get your reasoning, but the reason why we're only using iterators when pruning is precisely because we don't have an ordered key encoding, so we can't use iterators for anything (because we would just get things in the wrong order).

There's a bunch of places we should be using iterators but aren't currently. Every time we fetch a range of something, or the first or last item of a sequence, we should most likely be using an iterator. This includes e.g. RPC endpoints (such as /blockchain), base/height metadata for the block store, validator set checkpointing, and many other cases.

In most cases, we are usually only interested in one type of data (for example, in the /blockchain RPC we are only interested in blocks). Having to iterate over and filter out all of the unrelated keys is inefficient and inconvenient. It is more natural to use prefixes, since this allows us to iterate over each type in isolation. Efficiency-wise this is almost always a win. Convenience-wise this is usually a win (but as you say, pruning is a counterexample).

You should also consider how we would store data that doesn't have a height with the suffix scheme -- since integers span the entire byte space, it would be impossible to add a different key of more than 8 bytes and not have it conflict with integers. These random keys would then also turn up when iterating. As a contrived example, the string metadata is 6d65746164617461 as hex, which is equivalent to the integer 7882834684426679393 -- in principle it's impossible to know whether this is height 7882834684426679393 or the metadata key, so we wouldn't know to stop iterating before we reach it.

I'm not sure how ordering can improve the other operations that the blockstore or statestore does (can it lead to faster reads?)

Yes, it's always faster to iterate over 100 keys than it is to fetch each key as a separate operation. This depends on the underlying database, but usually key lookup is O(log n) while iterating is O(1) per step. It also allows the database to optimize disk access by e.g. fetching neighboring keys stored in the same page or prefetching keys.

I would also be for standardising our encoding system across all databases. When I had a glimpse they all had slightly different ways of representing the key.

I agree. In particular, we must take care that keys will not overlap (so that e.g. validator sets suddenly turn up in the middle of block keys).

erikgrinaker · 2020-12-09T09:40:51Z

Btw, we should also keep #2907 in mind (although I don't know if it would make sense to use the same encoding there). We should use a key-encoding that allows key segments to include arbitrary bytes (which needs some sort of escaping scheme for segment separators). For example, if we choose the separator to be / we would have to convert / in segments into \/ and convert \ into \\, and take this into account when parsing the key -- however, this escaping scheme does not preserve order and could not be used. These things are tricky to get right.

cmwaters · 2020-12-09T10:06:08Z

There's a bunch of places we should be using iterators but aren't currently.

I had a look at the light client which uses an iterator for the first and last light block, but we don't currently have these methods in the blockstore (they could be useful for something like the RPC or for fast sync). We just keep an in memory record of the base() and height(). Do we want to add these methods to the blockstore?

You should also consider how we would store data that doesn't have a height

Yeah this also crossed my mind. With the blockstore we also map block hashes to heights (where the hash is the key). So I thought it could be possible that a hash finds itself between two heights and could get accidentally removed in the process hence I did actually add a prefix one for height related data types and another for the block hashes.

Having to iterate over and filter out all of the unrelated keys is inefficient and inconvenient. It is more natural to use prefixes,

Okay I think I was designing too specifically for pruning. What I can do is group them back to their data types (which can be a single byte prefix). I can also use binary.BigEndian.PutUint64(buf, height) so that the heights are stored in the "correct" order.

I don't think this will be varint however (I would have to play around a little more to see). My understanding is that having varint reduces the total size of all the keys which allows db's to store more of them in memory. However another part of my brain says that having fixed key sizes would make it easier to iterate through the keys (cause you know where to start reading the next key).

We should use a key-encoding that allows key segments to include arbitrary bytes

but with tendermint's dbs we know in advance what the keys are going to be so I don't think allowing for arbitrary bytes is necessary. To be honest, I haven't looked into the TxIndexer so I'm not sure what's really required for it

erikgrinaker · 2020-12-09T10:15:50Z

I had a look at the light client which uses an iterator for the first and last light block, but we don't currently have these methods in the blockstore (they could be useful for something like the RPC or for fast sync). We just keep an in memory record of the base() and height(). Do we want to add these methods to the blockstore?

We'd have to look into what makes sense in each case. But for the blockstore base and height case, we currently store this metadata explicitly (via e.g. SaveBlockStoreState) and keep updating it, which is unnecessary and brittle - we could instead just check what the actual first and last blocks are directly using an iterator (iterating over one block), and remove all of the block store state stuff.

What I can do is group them back to their data types (which can be a single byte prefix). I can also use binary.BigEndian.PutUint64(buf, height) so that the heights are stored in the "correct" order. I don't think this will be varint however.

No, this isn't varint, it's just the big endian layout. It is ordered, but just takes up a lot of space. The orderedcode project I mentioned earlier compresses integers and preserves order.

However another part of my brain says that having fixed key sizes would make it easier to iterate through the keys (cause you know where to start reading the next key).

That's true, fixed-size segments are easier because you know where they start and end, so when using variable-length segments there needs to be some (order-preserving!) way of knowing where the segment ends.

but with tendermint's dbs we know in advance what the keys are going to be so I don't think allowing for arbitrary bytes is necessary. To be honest, I haven't looked into the TxIndexer so I'm not sure what's really required for it

With the txindexer the key will include arbitrary byte slices produced by the application (specifically tag.Key and tag.Value), and we have no idea what those will contain.

As you can probably tell, this is the sort of problem that initially appears simple, but gets complicated real fast. Which is why I think we should consider just adopting some existing encoding rather than reinventing one.

cmwaters · 2020-12-09T11:56:21Z

As you can probably tell, this is the sort of problem that initially appears simple, but gets complicated real fast. Which is why I think we should consider just adopting some existing encoding rather than reinventing one.

I agree. I skimmed through the orderedcode library. This seems sufficient for what we're looking for right? Would we want to consider other encoding libraries.

Outside of this, the other actionable would be as you said to replace the "base" and "height" stuff in Blockstore with methods that use the iterator. I had a look at the other db's and they seem to be fine in this regard

erikgrinaker · 2020-12-09T12:26:11Z

I agree. I skimmed through the orderedcode library. This seems sufficient for what we're looking for right? Would we want to consider other encoding libraries.

I don't know, I've only looked at it for like 3 minutes, but it seems reasonable.

Something to consider is whether the encoding will ever be exposed to other systems. In that case, it would make sense to use an encoding that has support in many languages, is well-documented, and/or is trivial to implement. This might happen e.g. with the txindexer (I don't know much about it or what the use-cases are), so it might be a good idea to look into that first. As long as all use of the encoding is internal, I'd say something like orderedcode makes sense.

Outside of this, the other actionable would be as you said to replace the "base" and "height" stuff in Blockstore with methods that use the iterator. I had a look at the other db's and they seem to be fine in this regard

There's probably a ton of stuff that can be improved by using iterators, but we'd have to review the codebase to find everything. I'd just change the encoding for now, and submit separate PRs for any other improvements -- either as a concerted effort, or just opportunistically as we find them.

cmwaters · 2020-12-09T14:19:17Z

Something to consider is whether the encoding will ever be exposed to other systems.

In a way isn't it already exposed. There's nothing stopping me from running a separate process that queries blockstore.db right (I just have to make sure I encode the key in the same way)? This could be an option that some take for example if they want to implement their own custom querier for the txs in a blockchain

erikgrinaker · 2020-12-09T14:27:30Z

In a way isn't it already exposed. There's nothing stopping me from running a separate process that queries blockstore.db right (I just have to make sure I encode the key in the same way)? This could be an option that some take for example if they want to implement their own custom querier for the txs in a blockchain

Yeah, but I'd consider this an unofficial interface, and as such people are on their own if they want to interact with it. The official interface would be RPC.

However, with e.g. txindexer I don't know if we officially support the use-case of exporting this data into some external datastore (e.g. PostgreSQL) for the explicit purpose of having some other application consume that data.

cmwaters · 2020-12-09T14:40:49Z

I would, at a guess, say no. Do you think it would be a service that would be worth offering?

alexanderbez · 2020-12-09T14:45:33Z

However, with e.g. txindexer I don't know if we officially support the use-case of exporting this data into some external datastore (e.g. PostgreSQL) for the explicit purpose of having some other application consume that data.

I think this is totally plausible and I've seen proposals of various sorts (from vulanizeDB I think?) that aim to consume indexed data and pipe it to some external set of services.

Note, in the SDK we use unique key prefixes for store keys with a / as a separator, however we don't escape or ensure all prefixes are the same length 😮

erikgrinaker · 2020-12-09T14:47:01Z

Yeah, this has bitten people before: tendermint/tm-db#6

state/store.go

github-actions · 2021-01-03T00:57:23Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

erikgrinaker

Looks good overall, left some minor comments.

One unresolved issue from before is the prefix handling. You're currently both using the raw prefix, e.g. here:

orderedcode.Append([]byte{prefixSeenCommit}, height)

But also using the encoded prefixes, e.g. here:

orderedcode.Append(buf, s.prefix, string(prefixLightBlock), height)

I think we should pick one approach and stick with it. I'd probably choose the latter approach, since this allows decoding the entire key slice with orderedcode without chopping off the prefix(es) ourselves. However, in that case I think we should use uints rather than strings for the constant prefixes, since these encode more compactly.

erikgrinaker · 2021-01-04T16:36:18Z

store/store.go

-func calcBlockCommitKey(height int64) []byte {
-	return []byte(fmt.Sprintf("C:%v", height))
+func blockCommitKey(height int64) []byte {
+	key, err := orderedcode.Append([]byte{prefixBlockCommit}, height)


Does int64 and uint64 encode differently? If so, we may want to pre-emptively use uint64 for heights, since I think we'll want to migrate to that in the future.

Yeah they are encoded differently. Strangely int64 is more efficient than uint64 in terms of size (in most cases)

height: 1000000 int64: [11101111 1000010 1000000] (len: 3), uint64: [11 1111 1000010 1000000] (len: 4)

Probably because int is used more often than uint, so they're optimizing for the typical case. Also, 1000000 is a fairly large value, I'm guessing there's no overhead for smaller values.

uint always has an extra byte for the length so there is always overhead. signed int has a different strategy where the length is built in to the first byte (n amount of 1's followed by a 0 indicating the length of bytes to come) making it generally more efficient. I don't know why uint couldn't have followed the same strategy

Seems so, that's unfortunate. 😞

store/store.go

light/store/db/db.go

erikgrinaker · 2021-01-04T16:47:52Z

evidence/pool.go

-
-func keySuffix(evidence types.Evidence) []byte {
-	return []byte(fmt.Sprintf("%s/%X", bE(evidence.Height()), evidence.Hash()))
+	key, err := orderedcode.Append([]byte{prefixPending}, evidence.Height(), string(evidence.Hash()))


We need to be careful with int64 and uint64 here, since changing the return value of evidence.Height() may cause the on-disk format to break. Might want to cast to uint64 to make sure it's stable, unless orderedcode can decode them interchangably.

What would you think about doing the opposite. Leaving it as is and when we change to uint64 we make sure we cast the height to int64. Int64 encoding has a different strategy to uint64 encoding which makes it almost always more or at least as efficient as uint64.

If you want I could deliberately cast it to int64 now so whoever makes the change to uint64 should recognise that this needs to remain as int64

int64 can't store the full range of uint64. While overflow is unlikely, these sorts of impedance mismatches are a fertile source of bugs. I'd say either stick with int64 while we're using int64 for heights and then change it later, or go with uint64 if that's the long-term plan.

If/when we change from int64 to uint64 this will be a breaking change to the db regardless so would it not make sense to change it all then (I'm not particularly attached to either approach).

And yes, if we stick with int64 we should make sure that we pass int64 to the key (either by a type assertion or checked cast), so that it won't suddenly break if someone changes the evidence.Height() return type.

light/store/db/db.go

cmwaters · 2021-01-04T18:32:09Z

I can include the prefix within orderedcode. After fiddling around a little, it seems to be more efficient to use int64 instead of uint64

CHANGELOG_PENDING.md

cmwaters added 3 commits December 8, 2020 15:31

use big endian keys and iterator for pruning blocks

751715d

use big endian keys for state store and use an iterator for pruning

7ad07b2

fmt

276d898

cmwaters added T:perf Type: Performance T:breaking Type: Breaking Change labels Dec 8, 2020

cmwaters changed the title ~~store: oder preserving varint encoding~~ store: oder preserving varint key encoding Dec 8, 2020

cmwaters changed the title ~~store: oder preserving varint key encoding~~ store: order-preserving varint key encoding Dec 8, 2020

cmwaters added 4 commits December 11, 2020 16:55

wip

c422616

add encoding and pruning for state store

59b9129

add more tests to state store

af1ece8

remove test panic

8289072

cmwaters commented Dec 14, 2020

View reviewed changes

state/store.go Show resolved Hide resolved

cmwaters added 5 commits December 14, 2020 20:24

restore test

a80b564

fix issue with iterator and batch writing

a0001c8

remove recursion

b5b2da6

add a few more tests

a34a17c

implement key encoding with block store and rewrite iterator

27d00dc

cmwaters added 3 commits December 22, 2020 16:23

Merge branch 'master' into callum/key-encoding

4f093e9

split out iterator changes

56b6ba1

clean up broken code and return errors when parsing keys

89d823d

github-actions bot added the stale for use by stalebot label Jan 3, 2021

cmwaters removed the stale for use by stalebot label Jan 4, 2021

cmwaters added 3 commits January 4, 2021 13:05

use different prefix for size in light db

681bb0f

Merge branch 'master' into callum/key-encoding

eb09ae3

remove unused decode function

b844bdb

cmwaters marked this pull request as ready for review January 4, 2021 12:51

erikgrinaker reviewed Jan 4, 2021

View reviewed changes

cmwaters added 4 commits January 5, 2021 09:12

use int64 for db prefixes

81d4568

Merge branch 'master' into callum/key-encoding

c1d7744

touch up incorrect string casting

04afa39

use nil instead of declaring empty buf

3ddd3f7

cmwaters requested a review from erikgrinaker January 5, 2021 11:01

cmwaters added 2 commits January 5, 2021 13:34

cast evidence height as int64

083351f

update changelog

41004d5

cmwaters commented Jan 5, 2021

View reviewed changes

CHANGELOG_PENDING.md Show resolved Hide resolved

erikgrinaker approved these changes Jan 5, 2021

View reviewed changes

cmwaters added 4 commits January 5, 2021 15:22

modify changelog

e4c9e70

Merge branch 'master' into callum/key-encoding

f27de9d

Merge branch 'master' into callum/key-encoding

b2d5632

Merge branch 'master' into callum/key-encoding

9e66afb

cmwaters merged commit 9b9222f into master Jan 5, 2021

cmwaters deleted the callum/key-encoding branch January 5, 2021 15:53

thanethomson mentioned this pull request Aug 11, 2022

store: Port order-preserving varint key encoding to main #9226

Closed

jmalicevic mentioned this pull request Jun 27, 2023

Reconsider representation of keys for state and block store cometbft/cometbft#1041

Closed

10 tasks

melekes mentioned this pull request Dec 7, 2023

feat(store)!: order-preserving varint key encoding cometbft/cometbft#1764

Closed

3 tasks

Conversation

cmwaters commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

erikgrinaker commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmwaters commented Dec 9, 2020

Uh oh!

erikgrinaker commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmwaters commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmwaters commented Dec 9, 2020

Uh oh!

erikgrinaker commented Dec 9, 2020

Uh oh!

cmwaters commented Dec 9, 2020

Uh oh!

erikgrinaker commented Dec 9, 2020

Uh oh!

cmwaters commented Dec 9, 2020

Uh oh!

alexanderbez commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Dec 9, 2020

Uh oh!

Uh oh!

github-actions bot commented Jan 3, 2021

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwaters Jan 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmwaters commented Jan 4, 2021

Uh oh!

Uh oh!

Reviewers

cmwaters commented Dec 8, 2020 •

edited

Loading

erikgrinaker commented Dec 8, 2020 •

edited

Loading

erikgrinaker commented Dec 9, 2020 •

edited

Loading

erikgrinaker commented Dec 9, 2020 •

edited

Loading

cmwaters commented Dec 9, 2020 •

edited

Loading

erikgrinaker commented Dec 9, 2020 •

edited

Loading

alexanderbez commented Dec 9, 2020 •

edited

Loading

cmwaters Jan 4, 2021 •

edited

Loading

cmwaters Jan 4, 2021 •

edited

Loading

cmwaters Jan 5, 2021 •

edited

Loading