Skip to content

Move read queue to a more general "command" queue.#81

Merged
spencerkimball merged 4 commits intomasterfrom
spencerkimball/command-queue
Sep 26, 2014
Merged

Move read queue to a more general "command" queue.#81
spencerkimball merged 4 commits intomasterfrom
spencerkimball/command-queue

Conversation

@spencerkimball
Copy link
Copy Markdown
Member

Over the course of testing distributed transactions, I noticed many
unnecessary restarts on SERIALIZABLE transactions due to the fact
that we previously were always updating the range's timestamp cache
as soon as a read or a write command arrived. For reads or writes
which conflict with an ongoing transaction, the update to the timestamp
cache would happen regardless of whether the read or write succeeded.
In many cases, the read or write would fail on the conflict, so the
restart due to advancing timestamp cache ended up being unnecessary.

This change blends both read/write and read-only commands into a
single command queue, which operates similarly to the old read queue,
only now commands always wait on preceding commands which overlap the
same key(s). Read-only commands don't need to wait on other
read-only commands, so we exempt this case for better read concurrency.

This allows us to wait until after the command is executed to update
the timestamp cache, so we can avoid updating the cache on failures.

Over the course of testing distributed transactions, I noticed many
unnecessary restarts on SERIALIZABLE transactions due to the fact
that we previously were always updating the range's timestamp cache
as soon as a read or a write command arrived. For reads or writes
which conflict with an ongoing transaction, the update to the timestamp
cache would happen regardless of whether the read or write succeeded.
In many cases, the read or write would fail on the conflict, so the
restart due to advancing timestamp cache ended up being unnecessary.

This change blends both read/write and read-only commands into a
single command queue, which operates similarly to the old read queue,
only now commands always wait on preceding commands which overlap the
same key(s). Read-only commands don't need to wait on other
read-only commands, so we exempt this case for better read concurrency.

This allows us to wait until after the command is executed to update
the timestamp cache, so we can avoid updating the cache on failures.
@spencerkimball
Copy link
Copy Markdown
Member Author

@cockroachdb/developers

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There're far fewer lines actually changed in this PR than github is showing. I "git mv" these two files from read_queue.go and read_queue_test.go, but it's not properly showing that in this diff. I'll highlight the actual non-naming and comment changes below for easier reviewing.

Another problem I discovered with SERIALIZABLE transactions was unnecessary
restarts with a pattern like: read("a"), write("a"). The issue was that
the read would put a timestamp in the timestamp cache for "a". Then the
write would arrive for the same transaction and be forced to increment
the timestamp it found there (you're not supposed to change past history).

However, there's no reason to do the increment if the most recent entry
in the timestamp cache was the result of the current transaction. The
addition of the txn ID (well md5 of it anyway) to the timestamp cache
lets us ignore max entries which our own transaction was responsible for.

Also, changed "high water timestamp" in the timestamp cache to the more
accurate "low water timestamp".
proto/data.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txnMD5[:md5.Size]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tbg
Copy link
Copy Markdown
Member

tbg commented Sep 25, 2014

LGTM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the ID as-is? What happens if there is a collision?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID as is is large-ish, since we suffix a UUID to the first byte key affected by the transaction. It's probably going to be about 100 bytes. Vs. this which is only adding 16 bytes. It'll mean saving a multiple (maybe 2-3x) on the size of this cache, which I expect to be relatively large when summed over all ranges--perhaps 120M with txn ID (assuming system is using transactions heavily) vs. 40-50M with this MD5 of txn ID.

Collisions would be fairly bad. It would mean possibly writing to an earlier timestamp after a reader had read the previous value at a later timestamp. Rewriting history, in other words. But the odds of a collision with a 16 byte md5 digest are infintesimal...like 1 in a billion trillion or something, given the size of the timestamp cache.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The odds of a random md5 collision are infinitesimal; I'm worried about the ability of a client to construct deliberate collisions. If the UUID is generated server-side then we should be fine, but if it's client-generated then it's possible for the client to select two keys with colliding md5s (and a common prefix so they end up in the same range) and then append the same UUID.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID is generated server-side, but the client could ignore it and supply its own txn ID. There are lots of other ways that an adversarial client could muck things up in the keyspace it can write to. The exact same broken guarantees resulting from a duplicate txn ID, could be more simply achieved by not advancing the timestamp between requests, etc. Clients won't munge the system if they're badly behaved; they'll just munge their own guarantees.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memory is a concern it would be a bit better to leave this as the original [md5.Size]byte instead of converting to a string (which takes up a little extra space and a separate allocation for the GC to consider)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. That's what I'll do.

spencerkimball added a commit that referenced this pull request Sep 26, 2014
Move read queue to a more general "command" queue.
@spencerkimball spencerkimball merged commit 9de5f19 into master Sep 26, 2014
@spencerkimball spencerkimball deleted the spencerkimball/command-queue branch September 26, 2014 18:40
tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, #185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
tbg referenced this pull request in tbg/cockroach Aug 31, 2016
We have been seeing long startup times which disappear spontaneously. During a
restart of the beta cluster, the following goroutine was observed, which suggests
that we were spending a lot of time GCing replicas on startup.

    engine              ??:0                     _Cfunc_DBIterNext(cockroachdb#324, cockroachdb#323, 0, 0, 0, 0, 0, 0, 0)
    engine              rocksdb.go:1135          (*rocksDBIterator).Next(cockroachdb#235)
    storage             replica_data_iter.go:104 (*ReplicaDataIterator).Next(cockroachdb#316)
    storage             store.go:1748            (*Store).destroyReplicaData(#109, cockroachdb#317, 0, 0)
    storage             store.go:841             (*Store).Start.func2(0x101b, cockroachdb#300, 0x36, 0x40, cockroachdb#301, 0x36, 0x40, cockroachdb#315, 0x3, 0x4, ...)
    storage             store.go:734             IterateRangeDescriptors.func1(cockroachdb#306, 0x40, 0x41, cockroachdb#307, 0x92, 0x92, cockroachdb#341, 0, 0x186c, 0x4000, ...)
    engine              mvcc.go:1593             MVCCIterate(cockroachdb#329, #68, #47, #81, cockroachdb#232, 0x9, 0x10, cockroachdb#233, 0xb, 0x10, ...)
    storage             store.go:738             IterateRangeDescriptors(cockroachdb#330, cockroachdb#196, #47, #81, cockroachdb#195, #179, #110)
    storage             store.go:867             (*Store).Start(#109, cockroachdb#330, cockroachdb#196, #179, #185, 0x1)
    server              node.go:405              (*Node).initStores(#78, cockroachdb#330, cockroachdb#196, #98, 0x1, 0x1, #179, 0, #55)
    server              node.go:330              (*Node).start(#78, cockroachdb#330, cockroachdb#196, #42, #129, #98, 0x1, 0x1, 0, 0, ...)
    server              server.go:431            (*Server).Start(#5, cockroachdb#330, cockroachdb#196, #95, 0x1)
    cli                 start.go:368             runStart(#34, #178, 0, 0x9, 0, 0)
    cobra               command.go:599           (*Command).execute(#34, #177, 0x9, 0x9, #34, #177)
    cobra               command.go:689           (*Command).ExecuteC(#33, #70, cockroachdb#343, #72)
    cobra               command.go:648           (*Command).Execute(#33, #71, cockroachdb#343)
    cli                 cli.go:96                Run(#64, 0xa, 0xa, 0, 0)
    main                main.go:37               main()
soniabhishek pushed a commit to soniabhishek/cockroach that referenced this pull request Feb 15, 2017
pav-kv pushed a commit to pav-kv/cockroach that referenced this pull request Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants