Move read queue to a more general "command" queue.#81
Conversation
Over the course of testing distributed transactions, I noticed many unnecessary restarts on SERIALIZABLE transactions due to the fact that we previously were always updating the range's timestamp cache as soon as a read or a write command arrived. For reads or writes which conflict with an ongoing transaction, the update to the timestamp cache would happen regardless of whether the read or write succeeded. In many cases, the read or write would fail on the conflict, so the restart due to advancing timestamp cache ended up being unnecessary. This change blends both read/write and read-only commands into a single command queue, which operates similarly to the old read queue, only now commands always wait on preceding commands which overlap the same key(s). Read-only commands don't need to wait on other read-only commands, so we exempt this case for better read concurrency. This allows us to wait until after the command is executed to update the timestamp cache, so we can avoid updating the cache on failures.
|
@cockroachdb/developers |
There was a problem hiding this comment.
There're far fewer lines actually changed in this PR than github is showing. I "git mv" these two files from read_queue.go and read_queue_test.go, but it's not properly showing that in this diff. I'll highlight the actual non-naming and comment changes below for easier reviewing.
Another problem I discovered with SERIALIZABLE transactions was unnecessary
restarts with a pattern like: read("a"), write("a"). The issue was that
the read would put a timestamp in the timestamp cache for "a". Then the
write would arrive for the same transaction and be forced to increment
the timestamp it found there (you're not supposed to change past history).
However, there's no reason to do the increment if the most recent entry
in the timestamp cache was the result of the current transaction. The
addition of the txn ID (well md5 of it anyway) to the timestamp cache
lets us ignore max entries which our own transaction was responsible for.
Also, changed "high water timestamp" in the timestamp cache to the more
accurate "low water timestamp".
proto/data.go
Outdated
|
LGTM |
There was a problem hiding this comment.
Why not just use the ID as-is? What happens if there is a collision?
There was a problem hiding this comment.
The ID as is is large-ish, since we suffix a UUID to the first byte key affected by the transaction. It's probably going to be about 100 bytes. Vs. this which is only adding 16 bytes. It'll mean saving a multiple (maybe 2-3x) on the size of this cache, which I expect to be relatively large when summed over all ranges--perhaps 120M with txn ID (assuming system is using transactions heavily) vs. 40-50M with this MD5 of txn ID.
Collisions would be fairly bad. It would mean possibly writing to an earlier timestamp after a reader had read the previous value at a later timestamp. Rewriting history, in other words. But the odds of a collision with a 16 byte md5 digest are infintesimal...like 1 in a billion trillion or something, given the size of the timestamp cache.
There was a problem hiding this comment.
The odds of a random md5 collision are infinitesimal; I'm worried about the ability of a client to construct deliberate collisions. If the UUID is generated server-side then we should be fine, but if it's client-generated then it's possible for the client to select two keys with colliding md5s (and a common prefix so they end up in the same range) and then append the same UUID.
There was a problem hiding this comment.
The UUID is generated server-side, but the client could ignore it and supply its own txn ID. There are lots of other ways that an adversarial client could muck things up in the keyspace it can write to. The exact same broken guarantees resulting from a duplicate txn ID, could be more simply achieved by not advancing the timestamp between requests, etc. Clients won't munge the system if they're badly behaved; they'll just munge their own guarantees.
storage/timestamp_cache.go
Outdated
There was a problem hiding this comment.
If memory is a concern it would be a bit better to leave this as the original [md5.Size]byte instead of converting to a string (which takes up a little extra space and a separate allocation for the GC to consider)
There was a problem hiding this comment.
Cool. That's what I'll do.
Move read queue to a more general "command" queue.
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.
engine ??:0 _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
engine rocksdb.go:1135 (*rocksDBIterator).Next(cockroachdb#235)
storage replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
storage store.go:1748 (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
storage store.go:841 (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
storage store.go:734 IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
engine mvcc.go:1593 MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
storage store.go:738 IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
storage store.go:867 (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, #185, 0x1)
server node.go:405 (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
server node.go:330 (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
server server.go:431 (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
cli start.go:368 runStart(#34, #178, 0, 0x9, 0, 0)
cobra command.go:599 (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
cobra command.go:689 (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
cobra command.go:648 (*Command).Execute(#33, #71, cockroachdb#343)
cli cli.go:96 Run(#64, 0xa, 0xa, 0, 0)
main main.go:37 main()
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.
engine ??:0 _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
engine rocksdb.go:1135 (*rocksDBIterator).Next(cockroachdb#235)
storage replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
storage store.go:1748 (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
storage store.go:841 (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
storage store.go:734 IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
engine mvcc.go:1593 MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
storage store.go:738 IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
storage store.go:867 (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, #185, 0x1)
server node.go:405 (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
server node.go:330 (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
server server.go:431 (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
cli start.go:368 runStart(#34, #178, 0, 0x9, 0, 0)
cobra command.go:599 (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
cobra command.go:689 (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
cobra command.go:648 (*Command).Execute(#33, #71, cockroachdb#343)
cli cli.go:96 Run(#64, 0xa, 0xa, 0, 0)
main main.go:37 main()
…tion_change logging location change
Over the course of testing distributed transactions, I noticed many
unnecessary restarts on SERIALIZABLE transactions due to the fact
that we previously were always updating the range's timestamp cache
as soon as a read or a write command arrived. For reads or writes
which conflict with an ongoing transaction, the update to the timestamp
cache would happen regardless of whether the read or write succeeded.
In many cases, the read or write would fail on the conflict, so the
restart due to advancing timestamp cache ended up being unnecessary.
This change blends both read/write and read-only commands into a
single command queue, which operates similarly to the old read queue,
only now commands always wait on preceding commands which overlap the
same key(s). Read-only commands don't need to wait on other
read-only commands, so we exempt this case for better read concurrency.
This allows us to wait until after the command is executed to update
the timestamp cache, so we can avoid updating the cache on failures.