RFC: parallel commits by tbg · Pull Request #24194 · cockroachdb/cockroach

tbg · 2018-03-25T02:47:17Z

The parallel commits tech note describes a proposal for reducing commit
latencies. It comes in a basic and extended flavor, the latter of which
requiring more engineering work. I believe we should pursue the former
sooner rather than later, and make the latter contingent on the fate of
the required second tech note, which suggests a new format for
transaction IDs and explores some of the expected benefits.

Release note: None

cockroach-teamcity · 2018-03-25T02:47:22Z

This change is

bdarnell · 2018-03-25T16:55:51Z

I'll have more comments on the substance of these proposals later, but first let me say that these should be RFCs instead of tech notes. RFCs are for proposals; tech notes are for describing the way things are.

Could the new STAGED transaction status be shared with XA prepare support (#22359) or is it orthogonal?

tbg · 2018-03-26T13:41:13Z

I'll have more comments on the substance of these proposals later, but first let me say that these should be RFCs instead of tech notes. RFCs are for proposals; tech notes are for describing the way things are.

I'm not opposed to turning this into an RFC when the time comes. I assume the same holds for follower reads then?

tbg · 2018-03-26T13:42:09Z

From a casual skim of the description, the XA support seems very similar to the STAGED status, except that no attempt is made to make it correct (there's a 1h timeout instead).

tbg · 2018-03-26T13:43:44Z

(and the commit latencies are not improved, but get worse -- come to think of it, the only similarity is really the introduction of a new transaction status). When a transaction is STAGED in my proposal, in 99.99% of cases it's really COMMITTED. So this is fundamentally different.

danhhz · 2018-03-26T14:55:50Z

Review status: 0 of 2 files reviewed at latest revision, all discussions resolved, all commit checks successful.

docs/tech-notes/parallel-commit.md, line 116 at r1 (raw file):

or perhaps SQL can stop using DeleteRange

Is this the same as #23258?

Comments from Reviewable

bdarnell · 2018-03-26T19:18:19Z

Reviewed 1 of 2 files at r1.
Review status: 1 of 2 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 18 at r1 (raw file):

1. the NodeID
1. the transaction's original timestamp (which is unique per node even across restarts, thanks to our HLC guarantees and sleeping out the MaxOffset)

A 4-byte node id plus a 12-byte timestamp is the same size as a UUID, so this is only more compact than the status quo when we're able to elide the timestamp (or use varint encoding).

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

1. the transaction's original timestamp (which is unique per node even across restarts, thanks to our HLC guarantees and sleeping out the MaxOffset)

Note that this is well-suited for storage into MVCC versioned keys: in the common case, transactions commit at their original timestamp, in which the full transaction ID is recoverable from the MVCC version timestamp and the NodeID. Consequently, only the latter needs to be stored in that case. When the transaction *was* pushed and thus did change its commit timestamp before committing, we additionally store the delta to the base timestamp (i.e. if the value is at 1000 but the transaction was originally 200, we store a delta of 800).

So you're proposing that MVCC keys change from (user key, timestamp) to (user key, timestamp, gateway node ID, delta from orig timestamp)? Spell this out, including the byte-level encoding.

How would you manage the migration? If we're going to make a change at this level we may want to consider it in conjunction with a change to the way historical versions are stored.

docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file):

Relying on the max offset to keep transaction IDs unique can be undesired. Instead, each node can keep a local counter which is incremented every time the node starts, and is part of the transaction ID. For clusters in which this feature is not enabled, the counter would remain at zero and would hence be stored "for free" (with proto3).

## Communicating with the txn coordinator directly

In some ways, this feels like a step backwards - it assumes that there is a single unchanging coordinator for the transaction. We're moving towards more parallelism of transactions and we aspire to one day handle coordinator failover (may not be practical until we swap out the pgwire protocol for something custom).

docs/tech-notes/node-txn-ids.md, line 30 at r1 (raw file):

Having the `NodeID` in the transaction ID unlocks an alternative mechanism to handling transactional conflicts by contacting the coordinator directly.

For example, when finding a conflicting intent, the txn can send a (streaming) RPC to the remote coordinator, compare priorities, and either wait for the coordinator to signal completion or prompt it to abort the transaction.

As an alternative to this, we could have the coordinator and every node that encounters a conflict maintain a streaming RPC to the leaseholder of the txn record's range. The heartbeat status could be maintained in memory on that range instead of going through consensus writes (on lease transfer, we could simply reset the clock on all pending txn's heartbeats). This gives us some of the benefits here without major low-level changes.

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

For example, when finding a conflicting intent, the txn can send a (streaming) RPC to the remote coordinator, compare priorities, and either wait for the coordinator to signal completion or prompt it to abort the transaction.

This will often be faster (depending on latencies between the involved nodes), thanks to the absence of consensus and polling in this path, and could replace the txn wait queue, though deadlock detection needs to be taken care of.

Once we stop sending no-op writes through raft (#23942), the consensus latency will mostly go away. I think the things that still go through consensus at that point will still need to.

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

If it did apply successfully, the intent at k1 would have been resolved automatically and the transaction record removed, so when retrying there is no way to tell whether it was our txn that actually wrote the value; we have to return an ambiguous result.

Now that we embed a transaction ID in each write (committed or not), we can do better if we also embed the sequence number used for the write (i.e. retain it from the intent) as we're now able to see that this is our prior write, and can execute the `CPut` as a no-op.

We could also address this issue of idempotency by storing information about committed transactions outside of the values themselves (a CommitSpan, similar to the AbortSpan. We used to have something like this in the ResponseCache, but removed it because the extra I/O and storage was deemed unnecessary).

Comments from Reviewable

bdarnell · 2018-03-26T20:11:35Z

Reviewed 1 of 2 files at r1.
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

> A transaction is committed if and only if there is a transaction record with status COMMITTED, or if has has status STAGED and all of the writes of that transaction are present (either in intent or resolved form).

The idea is that the txn coordinator sends the commit in parallel with status `STAGED` and then, after returning success to the client, updates the status to `COMMITTED`. This means that in the common case, transactions are not encountered in `STAGED` state.

All intents must be written to the transaction record when setting its status to STAGED, right?

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

The idea is that the txn coordinator sends the commit in parallel with status `STAGED` and then, after returning success to the client, updates the status to `COMMITTED`. This means that in the common case, transactions are not encountered in `STAGED` state.

The basic and extended proposal differ in their handling of intents. The basic proposal allows intent resolution only when the transaction record is `COMMITTED`. This in turn guarantees that when trying to recover the true transaction status from a `STAGED` transaction, a client may assume that a missing intent (which won't be writable in the future) implies that the transaction is aborted. In the extended proposal, intent resolution begins as soon as `DistSender` knows the transaction has commited (i.e. after having received successful responses from all requests forming the last batch). In this case, when checking intended writes, they may already be committed and must be linkable to the transaction which wrote them.

What exactly is a "missing intent"? What ensures that it won't be writeable in the future?

So the difference between "basic" and "extended" is that the extended proposal sends ResolveIntents in parallel with changing the txn's status from STAGED to COMMITTED?

docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):

Write(k1) ok Resolve(k1,k2,k3) ok

This improves latency but could reduce throughput because Resolve(k1) is now an additional raft write (when it was previously bundled into the commit). There's also some increased write amplification as the set of keys modified in the transaction are written twice (at Stage and Commit)

docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):

Write(k1) ok Resolve(k1,k2,k3) ok
Stage(k1) ok Commit(k1) ok

Resolve(k1) should go in the same batch as Commit(k1). If these are not in the same batch, they'll end up getting serialized by the command queue.

docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file):

Write(k1) ok
Stage(k1) ok
Write(k2)  ok crash

What is the significance of this crash?

docs/tech-notes/parallel-commit.md, line 194 at r1 (raw file):

Write(k3)     fail
conflicting client:
Read(k1)           discover-txn Check(k1,k2,k3) Abort(k1) Resolve(k1,k2,k3)

How does this determine that k3 is missing and not simply in flight?

docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):

## Aborting a transaction

The canonical way of aborting a transaction remains writing the transaction record, which is possible unless it has state `COMMITTED`. Otherwise, it's possible that the transaction managed to write that txn record, but missed one of its intended writes, which we then must discover and prevent; for this we introduce a read-write command `CheckIntent` which checks the intent or versioned key at the commit timestamp (if any). If the write was not found and the reader intends to abort, the command populates the timestamp cache and returns with an error (to short-circuit fanout of the remainder of the batch).

In the basic protocol, a pusher can change the txn record from STAGED to ABORTED. But in the extended protocol, this must be disallowed, because the coordinator may already be in the process of resolving the intents. The extended protocol must make one of the intents unresolvable before it can change the txn status.

Comments from Reviewable

tbg · 2018-03-26T22:54:10Z

Thanks for the comments, @bdarnell! I've responded to them but not updated the document yet. BTW, just in case you want to trim your efforts, I think the reasonable proposal to discuss is the basic DistSender one. The node timestamp proposal was required to make the extended DistSender proposal work, but I think this is a bit further out and it accordingly it's a lot less thought out. Going from the basic to the extended DistSender proposal should be pretty straightforward when the discovery problem is solved, so going to basic first seems natural.

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 18 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A 4-byte node id plus a 12-byte timestamp is the same size as a UUID, so this is only more compact than the status quo when we're able to elide the timestamp (or use varint encoding).

Yes, the reduction would be to 4 bytes in the common case since the timestamp is identical. For the case in which it's not, I was thinking we would varint encoding. Note that the new data lives in the values, not the keys.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

So you're proposing that MVCC keys change from (user key, timestamp) to (user key, timestamp, gateway node ID, delta from orig timestamp)? Spell this out, including the byte-level encoding.

How would you manage the migration? If we're going to make a change at this level we may want to consider it in conjunction with a change to the way historical versions are stored.

I haven't fully thought out what I'm proposing (but something like this is necessary for the extended parallel commit proposal). I thought the new fields would be added to the values. There appears to be no reason to add them to the keys. All you need to be able to do is to check the value at a timestamped version and retrieve the ID. That also makes the migration straightforward if you populate the new fields only after a cluster version bump. In fact it seems so straightforward that I'm worried I'm missing something.

Backup/restore would definitely need to pick up these new fields too, so there would be a concern about restoring data that was backed up before the migration fired, though the concern would be somewhat theoretical. I'm actually not sure how restore deals with intents. Isn't there a correctness problem anyway?

backup backs up an intent for a committed txn (but that record lives elsewhere)
intent gets resolved, transaction record gets GC'ed
backup is restored, so now the intent is back
intent gets deleted since the txn record is long gone.

Maybe backup/restore does something smarter with intents. Just checking here.

docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In some ways, this feels like a step backwards - it assumes that there is a single unchanging coordinator for the transaction. We're moving towards more parallelism of transactions and we aspire to one day handle coordinator failover (may not be practical until we swap out the pgwire protocol for something custom).

Yeah, in that regard, it does feel like a step backwards. But it doesn't close the door to communicating via the txn record in general, just makes it not the default. Also this parallel writes and failing over coordinators seems to me like one of the things we're better off postponing basically forever (I might be wrong here).

docs/tech-notes/node-txn-ids.md, line 30 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

As an alternative to this, we could have the coordinator and every node that encounters a conflict maintain a streaming RPC to the leaseholder of the txn record's range. The heartbeat status could be maintained in memory on that range instead of going through consensus writes (on lease transfer, we could simply reset the clock on all pending txn's heartbeats). This gives us some of the benefits here without major low-level changes.

Yeah, that could be an alternative. I'll include it as such.

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Once we stop sending no-op writes through raft (#23942), the consensus latency will mostly go away. I think the things that still go through consensus at that point will still need to.

Right. A general reworking of the txnwaitqueue which currently isn't very smart addresses some of these issues. I do like the perceived simplicity of calling into the coordinator, but I agree that there are ways to get all that without putting a NodeID everywhere.

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We could also address this issue of idempotency by storing information about committed transactions outside of the values themselves (a CommitSpan, similar to the AbortSpan. We used to have something like this in the ResponseCache, but removed it because the extra I/O and storage was deemed unnecessary).

The CommitSpan (and the response cache, which was wildly inefficient) had the problem that you were never sure when you could remove items from them. With the CommitSpan, we would have to introduce reference counting. It should work, too, but would shift weight from storage space to complexity. Not sure which one is better but probably the complexity.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

All intents must be written to the transaction record when setting its status to STAGED, right?

No, you only need the intents that belong to the last batch. But practically I think we would write all of the intents into the record at this time (but making sure to split them in two sets, "last batch" vs "all previous batches").

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

What exactly is a "missing intent"? What ensures that it won't be writeable in the future?

a missing intent is one which is mentioned in the intended writes set in the transaction record, and is found as missing during the discovery process (which in turn also prevents it from being written in the future if the discoverer has the intent to abort the txn if possible, which should be the only reason for running the discovery process).

The name "discovery" should change to something that implies the mutating nature. Open to suggestions, bit of a shame "resolution" is already taken.

docs/tech-notes/parallel-commit.md, line 116 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

or perhaps SQL can stop using DeleteRange

Is this the same as #23258?

Not the same reason, but the same wish, yep. Thanks for linking in the issue.

docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):
That's correct, though often they're hopefully written to the same memtable and wouldn't actually increase write amplification (right?)

It could resolve throughput. We're doing slightly more work.

as the set of keys modified in the transaction are written twice (at Stage and Commit)

is correct, as an artifact of not being able to "amend" a kv pair. The commit wouldn't add any keys but it needs to flip a bool, so it has to cycle it all in and out, unfortunately.

docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Resolve(k1) should go in the same batch as Commit(k1). If these are not in the same batch, they'll end up getting serialized by the command queue.

I agree that they should get in the same batch and will update the example and outline how this would be achieved. But why would they get serialized? A commit only reads the abort cache, a committing resolve doesn't care about the abort cache. What's the key blocking them? (In the example, Commit(k1) would not resolve local intents but as you suggest we should make that the default which requires some additional work in DistSender to not send duplicate intent resolutions for that span).

docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the significance of this crash?

It makes sure that the coordinator (whoever receives the TransactionAbortedError) doesn't clean up after itself. No significance other than that. I'll update this example to make that clear, agreed that it invites confusion.

docs/tech-notes/parallel-commit.md, line 194 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How does this determine that k3 is missing and not simply in flight?

The full story is a bit more complicated than I was able to squeeze into the example. What I assume here is that the conflicting client is willing to abort the transaction. So what it does is not only determine that k3 is missing, it prevents the intent from being written (see "Aborting a transaction" below). I'll clarify this.

docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In the basic protocol, a pusher can change the txn record from STAGED to ABORTED. But in the extended protocol, this must be disallowed, because the coordinator may already be in the process of resolving the intents. The extended protocol must make one of the intents unresolvable before it can change the txn status.

With both proposals, the pusher can only ever change from STAGED to ABORTED by preventing an intent from being written. Intent resolution happening always implies that the transaction is in fact committed (even it it still says STAGED on the tin). The only difference in the extended protocol is that we make use of the fact that DistSender finds out about that first and resolve early, and we pay a price for that during the discovery step.

There is a subtlety that I will add to this section, which is that preventing the write via the timestamp cache only prevents the current epoch of the transaction. We have to make sure that we don't fall into the following trap:

prevent intent with timestamp cache record at t=100
txn restarts, now at t=101
txn (re)writes all of its intents and updates its STAGED record, so it is in fact committed
client from 1. still aborts the txn record

So after preventing one of the writes, the staged transaction record can only be aborted if it hasn't since changed its provisional commit timestamp.

Comments from Reviewable

bdarnell · 2018-03-27T02:19:40Z

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I haven't fully thought out what I'm proposing (but something like this is necessary for the extended parallel commit proposal). I thought the new fields would be added to the values. There appears to be no reason to add them to the keys. All you need to be able to do is to check the value at a timestamped version and retrieve the ID. That also makes the migration straightforward if you populate the new fields only after a cluster version bump. In fact it seems so straightforward that I'm worried I'm missing something.

Backup/restore would definitely need to pick up these new fields too, so there would be a concern about restoring data that was backed up before the migration fired, though the concern would be somewhat theoretical. I'm actually not sure how restore deals with intents. Isn't there a correctness problem anyway?

backup backs up an intent for a committed txn (but that record lives elsewhere)

intent gets resolved, transaction record gets GC'ed

backup is restored, so now the intent is back

intent gets deleted since the txn record is long gone.

Maybe backup/restore does something smarter with intents. Just checking here.

OK, adding the new information to roachpb.Value (or changing the on-disk values to a new tag-compatible proto) is more feasible than changing the key encoding. It's still a pretty far-reaching change.

BACKUP (and dump) should resolve any intents (waiting out any pending txns) before proceeding. (unverified)

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Right. A general reworking of the txnwaitqueue which currently isn't very smart addresses some of these issues. I do like the perceived simplicity of calling into the coordinator, but I agree that there are ways to get all that without putting a NodeID everywhere.

@spencerkimball had some ideas today about making the txnWaitQueue smarter.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

No, you only need the intents that belong to the last batch. But practically I think we would write all of the intents into the record at this time (but making sure to split them in two sets, "last batch" vs "all previous batches").

If you write all intents, then anyone who stumbles across a staged transaction can verify and commit it. If you only write the ones in the last batch, then only the original gateway can commit it and anyone else who encounters the intents will have to abort it. This feels like it might lead to more aborts compared to the current design, but I'm not sure (depends on how the txnWaitQueue works for staged txns, I think)

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

which in turn also prevents it from being written in the future if the discoverer has the intent to abort the txn if possible

My question is what is the mechanism that prevents the intent from being written in the future? [ok, I see below that it's the tscache] What if Write(k2) gets delayed until after the txn has been staged and other txns are trying to abort the txn? I'm not convinced that everything is safely interlocked, at least in the extended case.

a missing intent is one which is mentioned in the intended writes set in the transaction record

What about commands like CPut that only sometimes write an intent? We have to populate the txn record's list of writes before we know whether an intent will be written or not.

docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's correct, though often they're hopefully written to the same memtable and wouldn't actually increase write amplification (right?)

It could resolve throughput. We're doing slightly more work.

as the set of keys modified in the transaction are written twice (at Stage and Commit)

is correct, as an artifact of not being able to "amend" a kv pair. The commit wouldn't add any keys but it needs to flip a bool, so it has to cycle it all in and out, unfortunately.

If the txn writes go to the same memtable, they'll only be written to one sstable, but there will still be two WAL writes.

docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree that they should get in the same batch and will update the example and outline how this would be achieved. But why would they get serialized? A commit only reads the abort cache, a committing resolve doesn't care about the abort cache. What's the key blocking them? (In the example, Commit(k1) would not resolve local intents but as you suggest we should make that the default which requires some additional work in DistSender to not send duplicate intent resolutions for that span).

EndTransaction currently declares the abort span for read/write whether it's a commit or abort:

cockroach/pkg/storage/replica_command.go

Line 165 in f111ff4

    
           spans.Add(spanset.SpanReadWrite, roachpb.Span{Key: keys.AbortSpanKey(header.RangeID, header.Txn.ID)})

Although I think you're right that it doesn't actually need to unless it's an abort and we're resolving local intents.

docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It makes sure that the coordinator (whoever receives the TransactionAbortedError) doesn't clean up after itself. No significance other than that. I'll update this example to make that clear, agreed that it invites confusion.

Ah, OK. I read this as the node responsible for k2 crashing.

docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):

With both proposals, the pusher can only ever change from STAGED to ABORTED by preventing an intent from being written.

OK, I missed the fact that CheckIntent populates the timestamp cache. (I was thinking it was like the tscache-oblivious ResolveIntent. But to the point above, this requires that all intents be written to the txn record when it is staged so that the txn can become committed even if the gateway node is gone.

Comments from Reviewable

codecov-io · 2018-03-27T03:35:54Z

Codecov Report

Merging #24194 into master will decrease coverage by 0.89%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #24194     +/-   ##
=========================================
- Coverage   75.65%   74.76%   -0.9%     
=========================================
  Files         867      845     -22     
  Lines      127703   125960   -1743     
=========================================
- Hits        96609    94168   -2441     
- Misses      23887    24762    +875     
+ Partials     7207     7030    -177

Impacted Files	Coverage Δ
pkg/sql/opt/constraint/constraint_set.go	`0% <0%> (-65.22%)`	⬇️
pkg/sql/opt/constraint/columns.go	`37.77% <0%> (-26.67%)`	⬇️
pkg/sql/opt/constraint/span.go	`71.87% <0%> (-16.67%)`	⬇️
pkg/storage/batcheval/cmd_reverse_scan.go	`88.88% <0%> (-11.12%)`	⬇️
pkg/sql/opt/constraint/constraint.go	`81.02% <0%> (-9.75%)`	⬇️
pkg/sql/opt/constraint/spans.go	`73.61% <0%> (-9.73%)`	⬇️
pkg/sql/explain_distsql.go	`77.77% <0%> (-7.41%)`	⬇️
pkg/sql/opt_catalog.go	`43.58% <0%> (-6.82%)`	⬇️
pkg/sql/distsqlrun/metadata_test_sender.go	`84.37% <0%> (-6.67%)`	⬇️
pkg/sql/opt/constraint/key.go	`88.49% <0%> (-6.2%)`	⬇️
... and 290 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d7113f...a2a46fc. Read the comment docs.

danhhz · 2018-03-27T13:59:25Z

Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

BACKUP (and dump) should resolve any intents (waiting out any pending txns) before proceeding. (unverified)

Yup. If backup hits an intent at or below the timestamp it's running at, it returns WriteIntentError and gets retried until no intents are outstanding. (So, backups never include intents)

Comments from Reviewable

spencerkimball · 2018-03-27T17:32:19Z

The changed transaction IDs proposal would be much better if it left out the direct-to-coordinator communication. I think that should not be conflated with the idea of a txn ID which is collapsed into each versioned value, because I find that part of the proposal not particularly useful and overly complicating. It just adds redundant mechanisms and IMO will make the system more difficult to understand, debug, and improve. The node ID should be part of the txn ID only because it provides uniqueness, not to add unnecessary new coordination through another node.

That said, I think the changed transaction IDs, despite the complexity they add, enable three key wins that justify them:

improved idempotency and reduction of ambiguous result errors
reduction of client-latency to one round of consensus for multi range txns
reduction of non-txn-record-range intent resolution from three rounds of consensus to two rounds of consensus

Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 9 at r1 (raw file):

- Conflict resolution can contact the gateway node directly, without going through consensus via the transaction record. In particular, this makes it feasible to not write the transaction record until late in the life of a transaction (which should have measurable performance implications), and does away with a lot of the complexity around having to track the `Writing` flag which has devored countless engineering hours.
- MVCC versioned values contain a timestamp in the key which in the common case is equal or very close to the transaction's original timestamp, so in turn the overhead for storing the transaction ID in *committed* values becomes cheap (just add the NodeID part). This enables
  - CDC (so that committed values can be grouped into transactions)

We should discuss this. I'm not convinced this should be a feature of CDC, although it keeps coming up. I'd like to understand the motivation behind exposing anything about transactions to CDC.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

BACKUP (and dump) should resolve any intents (waiting out any pending txns) before proceeding. (unverified)

Yup. If backup hits an intent at or below the timestamp it's running at, it returns WriteIntentError and gets retried until no intents are outstanding. (So, backups never include intents)

Adding a delta in the value which adjusts the [possibly incorrect] timestamp in the MVCC key adds non-trivial conceptual complexity. This change must enable big benefits to justify it.

docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah, in that regard, it does feel like a step backwards. But it doesn't close the door to communicating via the txn record in general, just makes it not the default. Also this parallel writes and failing over coordinators seems to me like one of the things we're better off postponing basically forever (I might be wrong here).

Adding another way for transactions to communicate / coordinate is complicated; the benefits must be sizable.

Also, while Cockroach inherently handles scale for the current transaction coordination mechanism by splitting and rebalancing ranges in which transaction records live, the txn coordinators are more problematic, because they're easy targets for imbalance of client requests. So that is another drawback worth considering, in addition to the fact that referencing the gateway node ID specifically makes that association far more immutable than it is in principal, if not practice, today.

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

@spencerkimball had some ideas today about making the txnWaitQueue smarter.

This section about communicating with the coordinator directly has some gaps. There is no consensus now when waiting on a transaction to commit / abort in the push txn queue. The push txn evaluation fails, and the water gets added to the queue. Waiters are returned when the waitee is finalized. The only time we have consensus on a PushTxn is when the txn record isn't there. The same problem would afflict the pusher if talking to the coordinator – the txn would be absent and it would have to fall back to the existing PushTxn pathway and would go through consensus to write the ABORTED record.

@bdarnell, the idea of adding deadlock detection again, elsewhere, should raise hairs on the back of your neck.

@tschottdorf, I think we can do much better on the txn wait queue for cases where there is dramatic contention. but that's a separate conversation.

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The CommitSpan (and the response cache, which was wildly inefficient) had the problem that you were never sure when you could remove items from them. With the CommitSpan, we would have to introduce reference counting. It should work, too, but would shift weight from storage space to complexity. Not sure which one is better but probably the complexity.

I like this property.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If you write all intents, then anyone who stumbles across a staged transaction can verify and commit it. If you only write the ones in the last batch, then only the original gateway can commit it and anyone else who encounters the intents will have to abort it. This feels like it might lead to more aborts compared to the current design, but I'm not sure (depends on how the txnWaitQueue works for staged txns, I think)

I agree. If you don't write all of the intents, then a concurrent reader / writer who is attempting to commit the staged transaction would otherwise not create a finalized transaction record that contained all of the intents. This could mess up eventual GC and lose writes via still-unresolved intents after the transaction record is cleaned up.

Comments from Reviewable

tbg · 2018-03-28T15:09:12Z

@spencerkimball I agree that the coordinator based conflict resolution is not a clear win (and for that reason it's not mentioned in the parallel commits note). I'll make this less central and point out some pros and cons more concisely.

Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.

docs/tech-notes/node-txn-ids.md, line 9 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We should discuss this. I'm not convinced this should be a feature of CDC, although it keeps coming up. I'd like to understand the motivation behind exposing anything about transactions to CDC.

Yeah, it's also a requirement that I'm not fond because the implementation is awkward. I think @danhhz is aware that this is something we'll need to figure out. I don't think it's set in stone at the moment. The reason I mention it here is because it would enable that should it be required, I'll make that clearer though.

docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Adding another way for transactions to communicate / coordinate is complicated; the benefits must be sizable.

Also, while Cockroach inherently handles scale for the current transaction coordination mechanism by splitting and rebalancing ranges in which transaction records live, the txn coordinators are more problematic, because they're easy targets for imbalance of client requests. So that is another drawback worth considering, in addition to the fact that referencing the gateway node ID specifically makes that association far more immutable than it is in principal, if not practice, today.

The idea is that it would remove the current mechanisms and the long tail of complexity that comes with BeginTransaction and the Writing flag which is very worthwhile. Other than that, I'm not sold on contact-the-coordinator, but it has advantages especially with the parallel commits.

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

@bdarnell, the idea of adding deadlock detection again

It definitely wouldn't be added again, but moved.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I agree. If you don't write all of the intents, then a concurrent reader / writer who is attempting to commit the staged transaction would otherwise not create a finalized transaction record that contained all of the intents. This could mess up eventual GC and lose writes via still-unresolved intents after the transaction record is cleaned up.

Yeah, that's right. The committed record must contain all intents and there's no hope of getting that (in general) if the staged one doesn't. But you only need to touch the intents for the last batch when dealing with a STAGED record.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
What kind of history are you concerned about? If a write gets delayed, it can get prevented or not. Only if the intent gets laid down there will be an attempt at committing.

What about commands like CPut that only sometimes write an intent?

Yeah, our current transactional model is flexible and allows you to continue using the transaction after an error, though as far as I'm aware we don't do that in practice. The idea would be that we stop doing that, at least in the last batch. Once you lay down a staged entry, it's all or nothing, you don't get to try again. (the proposal could be extended, but it adds more complexity and I don't think we should go there).

docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the txn writes go to the same memtable, they'll only be written to one sstable, but there will still be two WAL writes.

Good point. Yeah, that's a drawback. These losses can be recovered using the ideas in
#22349. You can hold the commit in memory for a short amount and try to piggy-back it on another proposal, thus saving the extra write, while you can already tell others about the committed record in the meantime. There are likely subtleties to that (GC for one) and we're likely not going to do that for quite some time. This seems like general Raft-level batching (i.e. fold qualifying nonoverlapping proposals arriving at about the same time into one) would address this kind of problem in general and reduce WAL writes across the bank.

docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

EndTransaction currently declares the abort span for read/write whether it's a commit or abort:

cockroach/pkg/storage/replica_command.go

Line 165 in f111ff4

spans.Add(spanset.SpanReadWrite, roachpb.Span{Key: keys.AbortSpanKey(header.RangeID, header.Txn.ID)})

Although I think you're right that it doesn't actually need to unless it's an abort and we're resolving local intents.

Filed #24285.

Comments from Reviewable

nvb · 2018-03-28T20:17:44Z

Thanks for splitting out the basic and extended proposals for parallel commits. To summarize our previous conversation at the risk of oversimplification, the tradeoff between the two approaches and the current state of Cockroach (in the happy path) comes down to:

current txn model:
- txn commit takes place after all writes have succeeded
- intent resolution begins after switching txn record from PENDING to COMMITTED 

basic proposal:
- txn commit takes place in parallel with the final batch of writes
- intent resolution begins after switching txn record from STAGING to COMMITTED

extended proposal:
- txn commit takes place in parallel with the final batch of writes
- intent resolution begins in parallel with switching txn record from STAGING to COMMITTED

So when we think about the effect of these two proposals, it might be fair to say that

the basic proposal reduces the latency of a given transaction for the client that issues it by 1 round trip
the extended proposal reduces the reduces the latency of a given transaction for the client that issues it and all clients that contend with it by 1 round trip

Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.

docs/tech-notes/parallel-commit.md, line 44 at r1 (raw file):

BeginTransaction(1)
CPut(1)
EndTransaction(1)

One interesting thing to note here is that this would currently be considered a 1PC transaction, which means that we would omit writing a txn record entirely. If there are other parallel batches then we'll need to avoid this optimization and write a txn record in the STAGED state immediately.

docs/tech-notes/parallel-commit.md, line 109 at r1 (raw file):

the batch which wrote the transaction record

I may have missed it, but nowhere in this proposal are you suggesting that we wait until STAGING time to write the transaction record, right? So you actually meant the batch which is STAGING the transaction record.

I think it's important to note somewhere that not only is this a set of point writes that the batch which commits the transaction record is intending to do, but they are a set of point writes that the batch which commits the transaction record is promising to do. Once a transaction moves into the STAGING state, it must succeed in writing an intent at every single one of these keys before being marked as COMMITTED.

@bdarnell may have been alluding to this earlier, but to be clear, this means that all supported operations that can be sent in parallel with a commit must write an intent on success, no matter what. We'll have to be careful about write requests that have no-op paths. This is a concern I've had about #16026 (comment) as well.

docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):

## Aborting a transaction

The canonical way of aborting a transaction remains writing the transaction record, which is possible unless it has state `COMMITTED`. Otherwise, it's possible that the transaction managed to write that txn record, but missed one of its intended writes, which we then must discover and prevent; for this we introduce a read-write command `CheckIntent` which checks the intent or versioned key at the commit timestamp (if any). If the write was not found and the reader intends to abort, the command populates the timestamp cache and returns with an error (to short-circuit fanout of the remainder of the batch).

Why does CheckIntent need to be read-write? Your plan is for this to be shared with #16026 (comment), right?

docs/tech-notes/parallel-commit.md, line 217 at r1 (raw file):

## 1PC Transactions

No change here, but the new code in `DistSender` needs to make sure to maintain the current behavior.

See my comment above. I think there is a subtle change here.

Comments from Reviewable

tbg · 2018-03-28T22:21:25Z

@nvanbenschoten That's a good summary, I put that in the intro.

Pushed an update that mostly tries to improve the presentation (though I'm sure there are some easter eggs left). In adding more examples I also found an important case that I was neglecting, namely that in which one of the final writes actually pushes the timestamp (or sets the write_too_old flag). I haven't fleshed that out yet, but it's a pretty central case that I think we can handle but need to see how to efficiently do so.

I haven't bothered with the node id note for now, especially now that the parallel commit note focuses on the basic proposal and has a clean interface to the other one (we need txn ids in the values).

Review status: 1 of 2 files reviewed at latest revision, 19 unresolved discussions.

docs/tech-notes/parallel-commit.md, line 44 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

One interesting thing to note here is that this would currently be considered a 1PC transaction, which means that we would omit writing a txn record entirely. If there are other parallel batches then we'll need to avoid this optimization and write a txn record in the STAGED state immediately.

Heh, that's a good point, but for the purpose of the example, we should forget there's a 1PC path. I added a sentence to that effect.

docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah, that's right. The committed record must contain all intents and there's no hope of getting that (in general) if the staged one doesn't. But you only need to touch the intents for the last batch when dealing with a STAGED record.

Done.

docs/tech-notes/parallel-commit.md, line 109 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

the batch which wrote the transaction record

I may have missed it, but nowhere in this proposal are you suggesting that we wait until STAGING time to write the transaction record, right? So you actually meant the batch which is STAGING the transaction record.

I think it's important to note somewhere that not only is this a set of point writes that the batch which commits the transaction record is intending to do, but they are a set of point writes that the batch which commits the transaction record is promising to do. Once a transaction moves into the STAGING state, it must succeed in writing an intent at every single one of these keys before being marked as COMMITTED.

@bdarnell may have been alluding to this earlier, but to be clear, this means that all supported operations that can be sent in parallel with a commit must write an intent on success, no matter what. We'll have to be careful about write requests that have no-op paths. This is a concern I've had about #16026 (comment) as well.

Very good point about no-op writes, added a section.

The proposal doesn't care when you write the record. In the interest of not conflating more than I already have, I'd just stick with what's there: BeginTransaction creates the record eagerly so it'll be there when you stage. This doesn't seem required but it's orthogonal and so I think we're better off discussing it separately.

docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Good point. Yeah, that's a drawback. These losses can be recovered using the ideas in
#22349. You can hold the commit in memory for a short amount and try to piggy-back it on another proposal, thus saving the extra write, while you can already tell others about the committed record in the meantime. There are likely subtleties to that (GC for one) and we're likely not going to do that for quite some time. This seems like general Raft-level batching (i.e. fold qualifying nonoverlapping proposals arriving at about the same time into one) would address this kind of problem in general and reduce WAL writes across the bank.

Done.

docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Filed #24285.

Example is hopefully in better shape now, but I'm sure I missed some things still.

docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why does CheckIntent need to be read-write? Your plan is for this to be shared with #16026 (comment), right?

This has changed to a read request (as it ought to) in the latest proposal. Wasn't thinking about #16026 in particular but it may be possible to do have common implementation. The one in #16026 needs to be fast, with the one here that's less critical.

docs/tech-notes/parallel-commit.md, line 217 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

See my comment above. I think there is a subtle change here.

Updated the wording, but there shouldn't be a real change here as discussed.

Comments from Reviewable

bdarnell · 2018-03-30T17:27:50Z

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Adding a delta in the value which adjusts the [possibly incorrect] timestamp in the MVCC key adds non-trivial conceptual complexity. This change must enable big benefits to justify it.

The timestamp delta is essentially a specialized compression scheme. I think it can be pretty well isolated at the lower levels, so it should be manageable. But I'm still not sure it's worth it. The snappy compression that's already used may already be enough to reduce the size of the timestamp (especially for the common case when it is unchanged)

docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):

@bdarnell, the idea of adding deadlock detection again, elsewhere, should raise hairs on the back of your neck.

Definitely. I'm advocating making the txnWaitQueue (and all the push/conflict machinery around the transaction record) smarter instead of moving responsibility away from it.

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

The CommitSpan (and the response cache, which was wildly inefficient) had the problem that you were never sure when you could remove items from them.

That's true of the AbortSpan too. If you manage to get a long running/replayed transaction after the 1h gc threshold, you can see anomalies. For a CommitSpan, the problem is that A) commits are more common than poisoning aborts so we must control the cost of these extra records and B) a GC'd AbortSpan record can only cause anomalies in transactions that are unable to be committed anyway, while a GC'd CommitSpan could (maybe?) allow a 1PC txn to apply twice.

If we're not too worried about B (and we probably shouldn't be with the min proposal timestamp blocking things from happening long after the fact), we could GC the transaction IDs out of the values after enough time has passed. We could even do this as a rocksdb compaction filter so we don't generate extra I/O to do it.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

What kind of history are you concerned about?

Stage txn1, concurrently write to keys A and B
Staged txn and key A are written, write to key B gets delayed. Coordinator crashes.
Two new transactions encounter this staged txn record and start to check the commit condition.
Txn2 sees write A but not write B, so txn1 needs to be aborted.
The write intent to key B applies.
Txn3 sees writes A and B, so txn1 needs to be commited.
Txn2 and Txn3 fire off their resolve and EndTransaction requests in parallel, potentially leading to inconsistent results.

I see now that this is handled by the fact that checking the intent updates the timestamp cache so the delayed write to key B can't happen.

Yeah, our current transactional model is flexible and allows you to continue using the transaction after an error, though as far as I'm aware we don't do that in practice. The idea would be that we stop doing that, at least in the last batch.

OK, I guess it's the case that now all of our transactional writes either write an intent or return an error (the no-ops that @nvanbenschoten is looking at are all non-transactional operations). But again, this seems like it might be moving in the wrong direction. We've talked about making ConditionFailedError a kind of result so you could more easily continue past it, and the general push to move more predicates and filters from SQL down to KV could lead to more kinds of conditional writes. It would be unfortunate if we had to give that up and force consensus writes for these no-ops. (Limiting this to writes in the last batch would mitigate that somewhat, at the expense of yet another special case that impacts transactional correctness)

docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):

The basic proposal can only commence intent resolution after marking the transaction record as `COMMITTED`. Swift intent removal is important to reduce wait times under contention, and the extended proposal improves on this by commencing intent resolution once the `STAGED` record and the intents have been laid down.

The price to pay for this extension is having to embed a transaction ID into every versioned value. This is an invasive but well-contained change that should not be taken lightly. The sibling tech note on [Node Transaction IDs](node-txn-ids.md) explores this circle of ideas, and in particular proposes a transaction ID scheme that avoids overhead for the per-version ID storage. The author is of the opinion that the basic proposal should be implemented first.

Here's an alternative to embedding the transaction ID: make ResolveIntent update the timestamp cache. This way, when a transaction is aborted and its intents resolved, any new value that gets written will have to be at a higher timestamp, and will therefore be distinguishable in the future.

The downside to this is that it pushes timestamps on conflicts that wouldn't today, but I think that's probably not a big deal. Most SQL writes use CPut which updates the timestamp cache anyway, the txnWaitQueue makes aborts relatively rare, and RefreshSpans makes timestamp pushes less painful when they happen.

docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):

If the check fails, the read returns with a structured error (`WritePreventedError`) but still populates the timestamp cache (similar to a conditional put), thus guaranteeing that the intent will not be laid down at that timestamp later on. Returning an error is a performance optimization: a (possibly moderately large) batch of these requests will be passed to `DistSender`, and returning a structured error early short-circuits this execution when it is no longer necessary; the caller interprets the structured error and knows that it has succeeded in preventing the promised write at the specified timestamp.

Note that `PreventIntentRequest` is odd in that it only really needs to populate the timestamp cache on error. This could be optimized but is likely not relevant, as use of the request in practice should be rare.

Shouldn't the error semantics be inverted now that this has been renamed to PreventIntent? It returns nil and updates the tscache if no intent is found, and returns an error if the value is there (or just do everything in the return value instead of using errors)

Updating the tscache should be cheap, since in the cases where we don't need to update it someone else will probably have updated the tscache for this key anyway.

docs/tech-notes/parallel-commit.md, line 156 at r2 (raw file):

### The full process

To run full status resolution, a client runs the following process.

What component is responsible for doing this? Is it part of the intentResolver's handling of WriteIntentErrors?

docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file):

To run full status resolution, a client runs the following process.

1. retrieve the *promised writes* from the `STAGED` transaction record, and note the transaction ID and provisional commit timestamp. Note that we are guaranteed that [these are all point requests](#ranged-intents).

Add the delay step from the "when to run this" paragraph.

docs/tech-notes/parallel-commit.md, line 161 at r2 (raw file):

1. construct a batch at the provisional commit timestamp that contains a `PreventIntentRequest` for each *promised write*, and includes the provisional commit timestamp and the transaction ID.
1. Run the batch, and act depending on the outcome:
    1. on `WritePreventedError`, an intent was prevented and we can move forward and abort the transaction. But note a subtlety: the transaction may have restarted and written a new `STAGED` record with a higher provisional commit timestamp, and may now be (implicitly)or explicitly) committed. Thus, if the provisional commit timestamp changed, we start over with the status resolution process. We may want to add a flag `abort_on_restart` to the transaction record which can be set in this case to avoid many iterations of this process when necessary.

Talk more about how you can restart after writing a STAGED record.

How do we discover that the provisional commit timestamp changed? Doesn't this effectively prevent us from resolving intents in parallel with marking the transaction as aborted? (maybe we don't care about that for aborts and only need to parallelize resolves for txns committed by their original coordinator)

docs/tech-notes/parallel-commit.md, line 163 at r2 (raw file):

Note that the fact that we discovered the transaction as committed implies that the original client, if still alive, can only ever issue an EndTransaction that attempts to COMMIT the transaction as well.

This is an interesting constraint. So far, a coordinator can always choose to abort its own transaction (even if it is already committed, the abort attempt will fail safely). This is embodied in methods like TxnCoordSender.tryAsyncAbort. We'll have to be careful to guard against aborting staged transactions (maybe changing tryAsyncAbort to use PushTxn, even though we just rejected such a change).

Comments from Reviewable

tbg · 2018-03-30T19:45:20Z

Only comment responses in this round (i.e. no updates to the text).

Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):
On a meta level, how do you think we should proceed with this part of the proposal? While I think some of the ideas are interesting, I don't see us doing anything here in the near term. Should this remain a rough brain dump (with the discussion worked in) that is checked in or just removed? I doubt it makes sense to turn it into an RFC at this point.

The snappy compression that's already used may already be enough

Yeah, maybe. Switching the txn id scheme is the big deal. Whether we add the specialized compression here is a small difference (I'd guess it's probably worth it, but don't know that).

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

We could even do this as a rocksdb compaction filter so we don't generate extra I/O to do it.

Yeah, but these values would have to be replicated (ok, maybe they don't, and we just lose idempotency on lease change -- seems OK), so doing that is roughly as complex as making GC use compaction filters (which I think we'll end up doing (again) one day). We'd need to teach the consistency checker these semantics.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

But again, this seems like it might be moving in the wrong direction.

Could you expand on that? As I see it, the impact I assign to parallel commit vastly outmatches the importance of optimizing for noop CPuts so I wonder where you're coming from.

Luckily I think you can have your cake and eat it too: You can actually contain this in DistSender by falling back to the basic proposal in these cases. When any writes comes back as no-ops (they must tell you) in the last wave, DistSender waits until it has marked the record as COMMITTED to resolve intents (that's just the basic proposal). Conflicting writers eat a little more delay, but the client itself gets the optimal latency. You leave around a STAGED record that contains a missing write for a short moment, but this does not matter. If anyone starts status resolution and "prevents" that write, they may beat you and abort you, but that's a small price to pay. More than likely you will commit first and preempt them.

docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Here's an alternative to embedding the transaction ID: make ResolveIntent update the timestamp cache. This way, when a transaction is aborted and its intents resolved, any new value that gets written will have to be at a higher timestamp, and will therefore be distinguishable in the future.

The downside to this is that it pushes timestamps on conflicts that wouldn't today, but I think that's probably not a big deal. Most SQL writes use CPut which updates the timestamp cache anyway, the txnWaitQueue makes aborts relatively rare, and RefreshSpans makes timestamp pushes less painful when they happen.

This only solves one half of the problem but now I see the convenient way out and I don't think we need to embed the txn ID everywhere. Exciting!!

The interesting half of the problem is if you can't prevent any of the writes. This means that for each key you check, you either find an intent of that txn at the right timestamp or a versioned value at the right timestamp. Let's assume that there's at least one versioned value; otherwise you're home free because the txns tell you whether the transaction is committed (note that the original client, even if it didn't receive all of the responses, must do proper status resolution so that it doesn't clobber around here while we decide).

So now we're in serious trouble. Did the transaction actually write everything and resolve it, or did it lay down a bunch of intents and then failed on that version at the same timestamp?[1]

If the latter, then we might (always?) see an intent of the transaction at a higher timestamp. If the "always" is true, then that actually solves the problem already. I'm just not sure that's really true (and whether we want to make it required forever; so far we do this to avoid starvation). It's definitely not true for 1PC transactions today (though that looks changeable).

Assuming we're still in need of a solution, if we're willing to eat a little bit of probabilistic issues, DistSender can hash the values for all proposed writes together and embeds the result in the staged transaction record. Then, the status resolution process can recompute the hash from the actual values and decide based on that: if the hash matches, we trust that the versions we found are legitimate and commit the transaction. If not, this proves that the transaction didn't actually manage to write any of its writes (because it would only resolve intents when it knew they would become visible, so it couldn't have started doing that, so that versioned value was written by someone else). Note that we would only use the hash in the rare situation above, so what we're paying here is a tiny bit of cpu to compute the hash and the storage in the txn record.

docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Shouldn't the error semantics be inverted now that this has been renamed to PreventIntent? It returns nil and updates the tscache if no intent is found, and returns an error if the value is there (or just do everything in the return value instead of using errors)

Updating the tscache should be cheap, since in the cases where we don't need to update it someone else will probably have updated the tscache for this key anyway.

This was originally called CheckIntent, but I think highlighting the mutating nature makes sense. CheckOrPreventIntent is my next best name for this (returning an error on prevention), do you have a better idea?

docs/tech-notes/parallel-commit.md, line 156 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What component is responsible for doing this? Is it part of the intentResolver's handling of WriteIntentErrors?

That's an implementation detail, but since it has to be done by various parties (anyone seeing intents, also DistSender itself, the GC queue, ...) I think it makes sense to have it on the intent resolver.

docs/tech-notes/parallel-commit.md, line 161 at r2 (raw file):

Talk more about how you can restart after writing a STAGED record.

Will do, but haven't yet.

How do we discover that the provisional commit timestamp changed?

You have to check the timestamp as you attempt to change the status of the txn (so it's like a CPut on that field).

Doesn't this effectively prevent us from resolving intents in parallel with marking the transaction as aborted? (maybe we don't care about that for aborts and only need to parallelize resolves for txns committed by their original coordinator)

Yes, that's correct. You run status resolution, then read the txn record, then do stuff in parallel (or update the txn record first, but then you eat consensus latency for aborting the intents). This logic is needed by everything including a simple push, so I think it needs to be an admin command that replaces PushTxn. (See the other comment below).

docs/tech-notes/parallel-commit.md, line 163 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Note that the fact that we discovered the transaction as committed implies that the original client, if still alive, can only ever issue an EndTransaction that attempts to COMMIT the transaction as well.

This is an interesting constraint. So far, a coordinator can always choose to abort its own transaction (even if it is already committed, the abort attempt will fail safely). This is embodied in methods like TxnCoordSender.tryAsyncAbort. We'll have to be careful to guard against aborting staged transactions (maybe changing tryAsyncAbort to use PushTxn, even though we just rejected such a change).

Yeah, tryAsyncAbort is a bit of a tangle that I wish we had never introduced. But a similar problem presents itself on SQL-level rollbacks. If the last batch comes back with some obscure error, we issue an explicit abort today which has the same problems. Note that PushTxn has to become an admin command because it needs to do status resolution. I think that means that we need to make PushTxn an admin command and introduce a lower-level ChangeStatus command that is only issues after proving the true status (and has the right timestamp semantics, etc).

Comments from Reviewable

tbg · 2018-03-30T21:14:26Z

Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

But again, this seems like it might be moving in the wrong direction.

Could you expand on that? As I see it, the impact I assign to parallel commit vastly outmatches the importance of optimizing for noop CPuts so I wonder where you're coming from.

Luckily I think you can have your cake and eat it too: You can actually contain this in DistSender by falling back to the basic proposal in these cases. When any writes comes back as no-ops (they must tell you) in the last wave, DistSender waits until it has marked the record as COMMITTED to resolve intents (that's just the basic proposal). Conflicting writers eat a little more delay, but the client itself gets the optimal latency. You leave around a STAGED record that contains a missing write for a short moment, but this does not matter. If anyone starts status resolution and "prevents" that write, they may beat you and abort you, but that's a small price to pay. More than likely you will commit first and preempt them.

Ah, disregard that last post. You can't just run the basic proposal. You have to wait to tell the client about the commit because you could get aborted.

Comments from Reviewable

tbg · 2018-03-30T23:55:52Z

Note to self: a "promised write" also needs to contain the sequence number (in congruence with #16026). Otherwise, an intent written earlier in the life of the transaction and attempted-but-failed-to-be-overwritten in the last batch would not be detected as missing.

I have an inkling that this proposal is one that could actually profit from being model checked.

Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2018-04-01T14:09:28Z

Note to self (2): the STAGED transaction record also has to contain the CommitTrigger (which may need to be carried over to the committing EndTransaction that the status resolution process may have to synthesize)

a-robinson · 2018-04-01T16:41:12Z

Reviewed 1 of 2 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.

docs/tech-notes/parallel-commit.md, line 132 at r2 (raw file):

#### Ranged Intents

A problem we need to address is that transactions may contain *ranged intents*. They are a result of either ranged writes (`DeleteRange`) or the condensing of spans (for large transactions). For a transaction containing ranged writes in its final batch, it's expensive and complicated to accurately check the commit condition, and we consider it out of scope.

It'll be interesting to benchmark whether the threshold for a large transaction to switch to approximate intent spans is also a good threshold for this. The cost of checking for promised intents may get expensive way before that point (especially if we don't really nail down the heuristic of how long to wait as described in the "When to run this" section)

docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add the delay step from the "when to run this" paragraph.

It'd be nice if the delay happened as part of the PreventIntentRequest evaluation, rather than coming before sending the PreventIntentRequests. As proposed, a contended workload could run into this delay pretty frequently, which would be painful in a high-latency cluster given that changing the transaction record status requires a consensus round-trip.

Of course, if we change to resolving the transaction status by talking with the gateway, this problem goes away.

docs/tech-notes/parallel-commit.md, line 198 at r2 (raw file):

## Examples
The examples below illustrate common scenarios for the final batch (where `ki` lives on range `i`). Time flows from left to right and each line corresponds to one goroutine (which may be reused for compactness). A batch directed at a single range is grouped in curly braces.

Nit, but these would be much easier to read if they were tables (or images with arrows indicating dependencies).

Comments from Reviewable

bdarnell · 2018-04-01T22:36:53Z

Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

On a meta level, how do you think we should proceed with this part of the proposal? While I think some of the ideas are interesting, I don't see us doing anything here in the near term. Should this remain a rough brain dump (with the discussion worked in) that is checked in or just removed? I doubt it makes sense to turn it into an RFC at this point.

The snappy compression that's already used may already be enough

Yeah, maybe. Switching the txn id scheme is the big deal. Whether we add the specialized compression here is a small difference (I'd guess it's probably worth it, but don't know that).

I think turning it into an RFC and merging it with a status of rejected or postponed would make sense, just to capture the discussion (this wouldn't have to meet the quality bar for an accepted RFC). We could also just leave the doc in a branch on your fork but that would leave it vulnerable to accidental deletion.

docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

We could even do this as a rocksdb compaction filter so we don't generate extra I/O to do it.

Yeah, but these values would have to be replicated (ok, maybe they don't, and we just lose idempotency on lease change -- seems OK), so doing that is roughly as complex as making GC use compaction filters (which I think we'll end up doing (again) one day). We'd need to teach the consistency checker these semantics.

Yeah, it's a little more complicated than GC via compaction filter because this would be rewriting data that is still live for reads. We'd have to special-case it in the consistency checker.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

Could you expand on that? As I see it, the impact I assign to parallel commit vastly outmatches the importance of optimizing for noop CPuts so I wonder where you're coming from.

I think I agree, but we've been planning for a while to push more sql processing down to the KV layer and this has just come out of the blue. But I agree that improving best-case write latency is really important and this is one of our few avenues to big wins in this area.

docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This only solves one half of the problem but now I see the convenient way out and I don't think we need to embed the txn ID everywhere. Exciting!!

The interesting half of the problem is if you can't prevent any of the writes. This means that for each key you check, you either find an intent of that txn at the right timestamp or a versioned value at the right timestamp. Let's assume that there's at least one versioned value; otherwise you're home free because the txns tell you whether the transaction is committed (note that the original client, even if it didn't receive all of the responses, must do proper status resolution so that it doesn't clobber around here while we decide).

So now we're in serious trouble. Did the transaction actually write everything and resolve it, or did it lay down a bunch of intents and then failed on that version at the same timestamp?[1]

If the latter, then we might (always?) see an intent of the transaction at a higher timestamp. If the "always" is true, then that actually solves the problem already. I'm just not sure that's really true (and whether we want to make it required forever; so far we do this to avoid starvation). It's definitely not true for 1PC transactions today (though that looks changeable).

Assuming we're still in need of a solution, if we're willing to eat a little bit of probabilistic issues, DistSender can hash the values for all proposed writes together and embeds the result in the staged transaction record. Then, the status resolution process can recompute the hash from the actual values and decide based on that: if the hash matches, we trust that the versions we found are legitimate and commit the transaction. If not, this proves that the transaction didn't actually manage to write any of its writes (because it would only resolve intents when it knew they would become visible, so it couldn't have started doing that, so that versioned value was written by someone else). Note that we would only use the hash in the rare situation above, so what we're paying here is a tiny bit of cpu to compute the hash and the storage in the txn record.

There appears to be a dangling footnote [1] in this comment.

I'm afraid I'm going to need a more detailed walkthrough of the problematic case here and how it's not true in 1PC txns today.

The hash solution is worth considering but I'd rather make sure we've covered all the alternatives first.

docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This was originally called CheckIntent, but I think highlighting the mutating nature makes sense. CheckOrPreventIntent is my next best name for this (returning an error on prevention), do you have a better idea?

Hmm. FixIntent, maybe, or does that sound too much like "repair" instead of "make unchangeable"?

docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It'd be nice if the delay happened as part of the PreventIntentRequest evaluation, rather than coming before sending the PreventIntentRequests. As proposed, a contended workload could run into this delay pretty frequently, which would be painful in a high-latency cluster given that changing the transaction record status requires a consensus round-trip.

Of course, if we change to resolving the transaction status by talking with the gateway, this problem goes away.

A delay in command evaluation seems like a bad idea since that blocks the command queue.

Comments from Reviewable

tbg · 2018-04-02T00:42:58Z

Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.

docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think turning it into an RFC and merging it with a status of rejected or postponed would make sense, just to capture the discussion (this wouldn't have to meet the quality bar for an accepted RFC). We could also just leave the doc in a branch on your fork but that would leave it vulnerable to accidental deletion.

Ok. I'm going to let this fall behind for now. We'll keep the reviewable ref forever, right? Still committing it somewhere is nicer for discoverability. But the document needs a good reworking before I can justify that.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Could you expand on that? As I see it, the impact I assign to parallel commit vastly outmatches the importance of optimizing for noop CPuts so I wonder where you're coming from.

I think I agree, but we've been planning for a while to push more sql processing down to the KV layer and this has just come out of the blue. But I agree that improving best-case write latency is really important and this is one of our few avenues to big wins in this area.

Yeah, doesn't hurt to take it into account. The pushdown is exclusively for reads, so I'm not concerned. Plus, distributed writes seem far away. But I'm also always surprised how much talk there is about it and have found that premature, so perhaps I'm the wrong judge.

FWIW, as you've pointed out, running CPuts that you expect to fail in the final batch is uncommon. If we ever need them, we can add a boolean must_intent on CPut that forces an intent to be written (can just write one that will be aborted by ResolveIntent no matter what, i.e. has a special flag) and coerce it to true in DistSender only when needed (in reality, we should have that flag on the batch and every writing request should be forced to be able to support it to avoid rot). You don't really pay a big price for it because that CPut is going through consensus in parallel with the remainder of the final batch (which is at the very least the staging of the txn record). Yes, your tail latencies will be a little higher, but now we're really talking fringe. Also, since the intent is "fake", contending readers can just push it ahead. Only writers need to force conflict (and also only because we don't allow more than one intent at a time). In particular, a CPut that does not match the last committed value can pretend it's a read and push the intent out of the way.

I'm still failing to see this become important, but if it does, I think the above solution works well, and the layers above DistSender can be oblivious to it (other than the effect on write contention handling due to the added intent).

docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):
Hmm, I remember putting the [1] in there but not where it was supposed to link.

I'm afraid I'm going to need a more detailed walkthrough of the problematic case here and how it's not true in 1PC txns today.

If you're confused, it's probably because I celebrated too early and looked only at the easy half of the hard half. (i.e. the hash solution is the best I've got so far).

For a very concrete example, let's say you have a transaction that in the last batch lays down a write on ranges 1 and 2 and commits on range 3. With the extended proposal, DistSender will send

Put(k1)        ok
Put(k2)    err(RPC timeout or WriteTooOld or anything else)
Stage(k3, promised=k1,k2)

and assume that the WriteTooOldError is for a committed write which happens to have the same timestamp as the staged transaction. The coordinator dies, so this transaction is not committed.

Now consider the problem the status resolution process has to solve. It sees the staged record and tries to prevent the writes at k1, k2. It visits k1, there's an intent at the right timestamp so no, this can't be prevented.

Now it looks at k2 and finds a committed version at the right timestamp. It has to decide whether this version was written by the txn (and of course it should eventually disprove it).

(my mistake was here - I was trying to use the fact that there isn't an intent here as proof that the gateway had not entered the commit phase but that's not true - it may have and all the intent resolutions + the commit may have failed, with the exception of the one to this key)

So the hash is the best I can do at this point.

docs/tech-notes/parallel-commit.md, line 132 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It'll be interesting to benchmark whether the threshold for a large transaction to switch to approximate intent spans is also a good threshold for this. The cost of checking for promised intents may get expensive way before that point (especially if we don't really nail down the heuristic of how long to wait as described in the "When to run this" section)

Note that we usually never have to check promised intents. That happens only when a coordinator dies with a staged transaction record. Usually you still just push the transaction record from PENDING to ABORTED and that's that. So, it's OK if this is a little expensive! You are already dealing with a system in which there are unavailable ranges for ~4 seconds (and in fact some of those you may end up having to wait for).

Besides, checking for the point intents is always strictly cheaper than successfully running the last batch of the txn. This is because you're just trying to read all of the intents in the worst case (not even returning any data), but the last batch had to put intents on all of them. So there's not really any throttling necessary, assuming we're reasonably good at not creating dangerously large last batches in the first place. The limiting factor is really how much we're willing to plop into a txn record (but, we can shard it rather easily!)

For ranged intents, this section is still a little confused (and I shall update it), thanks for bringing my attention to that. First, TxnCoordSender's span condensing plays no role here because DistSender will actually collect the promised writes itself and so it sees the real spans. And for trying to prevent an intent on the ranged span I erroneously thought that you'd have to do more work as for the single key case, but you don't really (except you have to do it for every key in the span). But the same argument holds: you're doing a lot less work then the batch did.

So it may actually be feasible to lift this restriction. I think that's well worth doing.

docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):

(or just do everything in the return value instead of using errors)

btw, I had thought about that, but then the natural next step is to send this in a batch with a limit, but DistSender is notoriously bad at parallelizing those, so that wasn't a good idea either.

Hmm. FixIntent, maybe, or does that sound too much like "repair" instead of "make unchangeable"?

Warms my heart less than CheckOrPreventIntent (which in turn also doesn't instill me with a desire for jubilation). Easy enough to change when inspiration strikes.

docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

A delay in command evaluation seems like a bad idea since that blocks the command queue.

I think I wasn't particularly clear with the delay. This is because I backed out some ideas here and didn't update it accordingly. Basically the current contention resolution stays the same. You communicate via the txn record and the "delay" really just means "wait until the txn record is pushable based on heartbeat timeout". You don't eat the delay usually. You need a very specific situation in which a coordinator managed to STAGE the transaction but then died. In that situation, today, you'll also wait for twice the heartbeat timeout, and I was just proposing to keep that.

docs/tech-notes/parallel-commit.md, line 198 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nit, but these would be much easier to read if they were tables (or images with arrows indicating dependencies).

I'll see what I can do.

Comments from Reviewable

tbg · 2018-04-02T02:06:21Z

Another note to self: if we include range promised writes, we may still run into problems for the case in which there is a limit. In particular, when that limit is on the batch. Though that limit forces DistSender into serial execution, --

…

-- Tobias

tbg · 2018-04-02T02:08:28Z

... so it's a lot of code golf and, I think, not anything we need to address architecturally. Will add a section.

On Sun, Apr 1, 2018, 10:06 PM Tobias Schottdorf ***@***.***> wrote: Another note to self: if we include range promised writes, we may still run into problems for the case in which there is a limit. In particular, when that limit is on the batch. Though that limit forces DistSender into serial execution, -- -- Tobias

--

…

-- Tobias

bdarnell · 2018-04-02T18:21:08Z

Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.

docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yeah, doesn't hurt to take it into account. The pushdown is exclusively for reads, so I'm not concerned. Plus, distributed writes seem far away. But I'm also always surprised how much talk there is about it and have found that premature, so perhaps I'm the wrong judge.

FWIW, as you've pointed out, running CPuts that you expect to fail in the final batch is uncommon. If we ever need them, we can add a boolean must_intent on CPut that forces an intent to be written (can just write one that will be aborted by ResolveIntent no matter what, i.e. has a special flag) and coerce it to true in DistSender only when needed (in reality, we should have that flag on the batch and every writing request should be forced to be able to support it to avoid rot). You don't really pay a big price for it because that CPut is going through consensus in parallel with the remainder of the final batch (which is at the very least the staging of the txn record). Yes, your tail latencies will be a little higher, but now we're really talking fringe. Also, since the intent is "fake", contending readers can just push it ahead. Only writers need to force conflict (and also only because we don't allow more than one intent at a time). In particular, a CPut that does not match the last committed value can pretend it's a read and push the intent out of the way.

I'm still failing to see this become important, but if it does, I think the above solution works well, and the layers above DistSender can be oblivious to it (other than the effect on write contention handling due to the added intent).

must_intent is tricky because we currently can't persist anything and return an error at the same time, so this would also require turning ConditionFailedError into a value (or allowing both errors and a non-empty WriteBatch).

docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Hmm, I remember putting the [1] in there but not where it was supposed to link.

I'm afraid I'm going to need a more detailed walkthrough of the problematic case here and how it's not true in 1PC txns today.

If you're confused, it's probably because I celebrated too early and looked only at the easy half of the hard half. (i.e. the hash solution is the best I've got so far).

For a very concrete example, let's say you have a transaction that in the last batch lays down a write on ranges 1 and 2 and commits on range 3. With the extended proposal, DistSender will send
Put(k1)        ok
Put(k2)    err(RPC timeout or WriteTooOld or anything else)
Stage(k3, promised=k1,k2)
and assume that the WriteTooOldError is for a committed write which happens to have the same timestamp as the staged transaction. The coordinator dies, so this transaction is not committed.

Now consider the problem the status resolution process has to solve. It sees the staged record and tries to prevent the writes at k1, k2. It visits k1, there's an intent at the right timestamp so no, this can't be prevented.

Now it looks at k2 and finds a committed version at the right timestamp. It has to decide whether this version was written by the txn (and of course it should eventually disprove it).

(my mistake was here - I was trying to use the fact that there isn't an intent here as proof that the gateway had not entered the commit phase but that's not true - it may have and all the intent resolutions + the commit may have failed, with the exception of the one to this key)

So the hash is the best I can do at this point.

Ack.

Unfortunately, I don't think a hash can work either. Suppose the put at k1 is an insert, and the put at k2 is incrementing a counter. The conflicting transaction is an insert at k4 and incrementing the same counter at k2. This will hash identically, but one of the counter increments would be lost. I think the extended proposal will require transaction IDs to be stored in the persisted value.

Comments from Reviewable

tbg · 2018-04-16T15:51:35Z

Review status: all files reviewed at latest revision, 22 unresolved discussions.

docs/RFCS/20180324_parallel_commit.md, line 258 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

But pushed timestamps and WriteTooOld are (first) handled at the TxnCoordSender level; we don't want to return those errors to the SQL client unless RefreshSpans has failed.

That's a good point, I hadn't looked at this code closely. Ultimately there will be some refactoring here because the implementation of how spans are refreshed really isn't sane use of client.Sender any more. Either way, DistSender needs to pass up the responses to what was written and the retry error, so that the refresh can happen appropriately. Let me know if the text I added needs more details.

docs/RFCS/20180324_parallel_commit.md, line 300 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Still unclear. Is this talking about the deletion of the transaction record in an EndTransaction with no external intents? I'd say something like this if I've got it right: "Transaction records for committed transactions are normally deleted as soon as they have no outstanding external intents. To address this race, we will introduce a distinction between committing EndTransactionRequests issued by the status resolution process and those issued by TxnCoordSender. The transaction record will only be deleted after the TxnCoordSender has sent its commit or through the GC queue."

Updated. I mean both what you say but also the eager cleanup after resolving the external intents. Used your sentence as well.

Comments from Reviewable

andreimatei · 2018-04-17T03:43:10Z

Review status: 0 of 1 files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.

docs/RFCS/20180324_parallel_commit.md, line 202 at r6 (raw file):

the transaction is potentially abortable

This is something that I think is not very clear throughout the RFC - there's always the possibility of a race between a concurrent txn doing status resolution and a late write from the main txn. We don't want to abort transactions aggressively. I guess this is where the check in the status resolution process for abandoned transactions comes into play, right? Maybe you can make this more explicit.

docs/RFCS/20180324_parallel_commit.md, line 274 at r6 (raw file):

#### PreventIntentRequest

At the heart of the process is trying to prevent an intent of the transaction to be laid down (which is only possible if it isn't already there) at at the provisional commit timestamp. To achieve this efficiently, we introduce a new point read request, `PreventIntentRequest`. This request populates the timestamp cache (as any read does) and returns whether there is an intent at the given key for the specified transaction, timestamp, and at *at least* the specified sequence number. We don't check the exact sequence number because a batch could contain overlapping writes, in which case only the latest sequence number matters. If we trust that `PromisedWrites` has been faithfully populated, checking for "greater than or equal" is equivalent to (but simpler than) computing and keeping only the last write to a given key's sequence number.

can't we use the AbortCache for this? Wouldn't that be better than using the TimestampCache?

docs/RFCS/20180324_parallel_commit.md, line 349 at r6 (raw file):

and might decide to try a different write after

A different write on the same key would be OK thought, wouldn't it?

docs/RFCS/20180324_parallel_commit.md, line 355 at r6 (raw file):

Otherwise, if there is a transaction record, we can also run the process hoping for a positive outcome.

What do you mean "otherwise"? There's always a transaction record, isn't it?

Comments from Reviewable

Cut the commit latency for the final batch of a transaction in half, from two rounds of consensus down to one, by addressing a shortcoming of the transactional model which does not allow `EndTransaction` to be sent in parallel with other writes. This is achieved through the introduction of a new transaction status `STAGED` which can be written in parallel with the final batch. With proper batching, this change translates directly into a reduction of SQL write latencies. The latency until intents are resolved remains the same (two rounds of consensus). Thus, contended workloads are expected to profit less from this feature. Release note: None

tbg · 2018-04-17T10:40:09Z

Review status: 0 of 1 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.

docs/RFCS/20180324_parallel_commit.md, line 202 at r6 (raw file):
This example walks the line a little bit. On the one hand, I haven't actually explained everything yet, but I also want to give the example early. I'm a bit confused because this sentence that you're commenting on mentions that the abort would only happen if the txn looked abandoned though,

However, as explained later, concurrent transactions will only make use of this if the transaction record looked abandoned, which wouldn't be the case.

so I assume you were confused about this before this paragraph? I added something to the summary,

This is achieved through the introduction of a new transaction status STAGED which can be written in parallel with the final batch. This transaction status is usually short-lived; during failure events, transactions may be abandoned in this status and are recovered via a newly-introduced status resolution process.

Does that help at all?

docs/RFCS/20180324_parallel_commit.md, line 274 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can't we use the AbortCache for this? Wouldn't that be better than using the TimestampCache?

The abort cache is updated via Raft writes, so this would turn PreventIntent into a range write (which then eventually would need GC). This in itself doesn't seem like a problem (because it's supposed to be used rarely), but I also don't see why using the abort cache is any better. If the intent were to be written you'd likely populate the timestamp cache anyway, so why try to avoid it in the first place?

docs/RFCS/20180324_parallel_commit.md, line 349 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

and might decide to try a different write after
A different write on the same key would be OK thought, wouldn't it?

Yeah, in theory you could do that (if you also used the same sequence number), but it's a slippery slope. The code making the decision of which key to write next wouldn't be aware of any of these constraints.

docs/RFCS/20180324_parallel_commit.md, line 355 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Otherwise, if there is a transaction record, we can also run the process hoping for a positive outcome.
What do you mean "otherwise"? There's always a transaction record, isn't it?

The next paragraph explains this (the txn record could have become explicitly resolved and subsequently GC'ed).

Comments from Reviewable

vivekmenezes · 2018-04-17T15:26:18Z

docs/RFCS/20180324_parallel_commit.md

+The proposed changed condition (referred to as the **commit condition**) is:
+
+> A transaction is committed if and only if a) there is a transaction record with status COMMITTED, or b) one with status STAGED and all of the intents written in the last batch of that transaction are present.
+


I found the term "last batch" confusing

nvb · 2018-04-17T18:14:08Z

Review status: 0 of 1 files reviewed at latest revision, 27 unresolved discussions.

docs/RFCS/20180324_parallel_commit.md, line 162 at r7 (raw file):

Instead, the request is retried and sent in the "conventional" way (the intent at k1 is resolved as part of the Commit):

Make a note here that in the first case the EndTxn req is sent with a status of COMMITTED because DistSender notices that it is sent to a single range. This is how the 1PC fast path picks up the request. However, in the second case the EndTxn req is sent with a status of STAGING because DistSender notices that it is sent to multiple ranges.

docs/RFCS/20180324_parallel_commit.md, line 187 at r7 (raw file):

Abort(k1) ok|
            |Resolve(k2)
            |Resolve(k3)

Where does all of this happen? Still in DistSender? This is an implementation issue so feel free to defer to later, but I'm curious what components are going to concern themselves with what operations.

docs/RFCS/20180324_parallel_commit.md, line 217 at r7 (raw file):

- commit trigger. Commit triggers are only used by internal transactions for which the transaction record's lease is usually colocated with the client running the transaction (so that only one extra round trip to the follower is necessary to commit on the slow path). Support for this can be added later: add a commit trigger to the `STAGED` proto and, if set, don't allow the commit trigger to change when the transaction commits explicitly.

If the batch is not eligible, today's behavior is applied, i.e. the batch is first sent without the commit, and then committed separately.

make a note that we skip the STAGED step entirely.

docs/RFCS/20180324_parallel_commit.md, line 234 at r7 (raw file):

  // The parallel commit mechanism is only active for batches which
  // contain no range requests or commit trigger.
  repeated message PromisedWrite {

Make a note why this doesn't need to include an epoch number.

docs/RFCS/20180324_parallel_commit.md, line 234 at r7 (raw file):

  // The parallel commit mechanism is only active for batches which
  // contain no range requests or commit trigger.
  repeated message PromisedWrite {

No need to change anything here, but let's create this as a separate proto message called SequencedKey so that it can be shared between parallel commits and the OutstandingProposals tree needed for #16026.

docs/RFCS/20180324_parallel_commit.md, line 272 at r7 (raw file):

An alternative to be considered is to return an error instead. This doesn't work well for batches of push requests, though. If during implementation we decide for the first option, we may also consider removing `TransactionPushError` in the process.

#### PreventIntentRequest

I don't see any reason why this can't be used in place of the proposed CheckIntent rpc in #16026, other than maybe the name. I'm satisfied!

docs/RFCS/20180324_parallel_commit.md, line 274 at r7 (raw file):

#### PreventIntentRequest

At the heart of the process is trying to prevent an intent of the transaction to be laid down (which is only possible if it isn't already there) at at the provisional commit timestamp. To achieve this efficiently, we introduce a new point read request, `PreventIntentRequest`. This request populates the timestamp cache (as any read does) and returns whether there is an intent at the given key for the specified transaction, timestamp, and at *at least* the specified sequence number. We don't check the exact sequence number because a batch could contain overlapping writes, in which case only the latest sequence number matters. If we trust that `PromisedWrites` has been faithfully populated, checking for "greater than or equal" is equivalent to (but simpler than) computing and keeping only the last write to a given key's sequence number.

s/at *at/*at/

docs/RFCS/20180324_parallel_commit.md, line 278 at r7 (raw file):

The request also returns whether there was an intent of the transaction *above* the expected timestamp. If this happens, the transaction has restarted or been pushed, and should instruct the caller to check the transaction record for a new update (since status resolution isn't kicked off unless a transaction looks abandoned, this may not be worth it in practice).

As an optimization, we might return a structured error when an intent was prevented (but still populate the timestamp cache), to short-circuit execution of the remainder of the batch.

This is easily possible with the updatesTSCacheOnError flag.

docs/RFCS/20180324_parallel_commit.md, line 345 at r7 (raw file):

The set of **promised writes** are writes that *need to leave an intent*. This is in contrast to the remaining *intent spans*, which only need to be a *superset* of the actually written intents. Today, this is true: any successful "write" command leaves an intent.

But, it is a restriction for the future: We must never issue [no-op writes](https://github.com/cockroachdb/cockroach/issues/23942) in the final batch of a transaction, and appropriate care must be taken to preserve this invariant.

I'm curious what care you envision? Can we add an assertion somewhere on the write-path to enforce this?

docs/RFCS/20180324_parallel_commit.md, line 368 at r7 (raw file):

- in the likely case, none of the intents will be prevented (and they are found at the right timestamps), as they have been dispatched earlier; the `PreventIntentRequest`s succeed, as does the rest of the batch, and `DistSender` announces success to the client.
- an intent is prevented or found at the wrong timestamp. This is treated like a write having failed or having pushed the transaction, respectively.

When would an intent be found at the wrong timestamp?

Comments from Reviewable

This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both cockroachdb#24194 and cockroachdb#16026. It is a prerequisite for either proposal. Release note: None

nvb · 2018-06-09T17:55:07Z

A small point I realized with pipelined writes and believe applies here as well is that this algorithm and a lot of its assumption only apply to SERIALIZABLE transactions. This is because reads conflicting with a SNAPSHOT transaction can push the transaction's intents forward without needing to prevent it from aborting. Without thinking about it too carefully, I don't know whether that means we should disable all of this for SNAPSHOT transactions or whether it just means that the status resolution process will need to be adjusted (made much simpler, I assume).

bdarnell · 2018-06-11T15:41:01Z

I'm not sure that changes much - snapshot transactions can have their intents pushed forward without aborting, but that's still a push operation on the transaction record, not something local to the intent.

Still, I'd rather disable this optimization for SNAPSHOT (if SNAPSHOT survives at all: #26475) than spend too much time reasoning about whether it's safe.

This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both cockroachdb#24194 and cockroachdb#16026. It is a prerequisite for either proposal. Release note: None

26335: roachpb: introduce QueryIntent request r=bdarnell a=nvanbenschoten This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both #24194 and #16026. It is a prerequisite for either proposal. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

tbg · 2018-08-15T13:28:19Z

Merging as a draft to clear out my PR queue.

bors r+

24194: RFC: parallel commits r=tschottdorf a=tschottdorf The parallel commits tech note describes a proposal for reducing commit latencies. It comes in a basic and extended flavor, the latter of which requiring more engineering work. I believe we should pursue the former sooner rather than later, and make the latter contingent on the fate of the required second tech note, which suggests a new format for transaction IDs and explores some of the expected benefits. Release note: None 28589: opt: fix panic caused by select from table with no columns r=rytaft a=rytaft This commit fixes a panic caused by running `SELECT *` from a table with no visible columns (e.g., a table with only the hidden rowid column). The bug was caused by the `optbuilder` creating a `Presentation` slice even when there are no columns to present. Ensuring that the slice is nil in this case fixes the bug. Fixes #28388 Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

craig · 2018-08-15T13:48:51Z

Build succeeded

GitHub CI (Cockroach)

nvb · 2018-10-05T05:51:55Z

Performance Implications

There is an extra write to the WAL due to the newly introduced intermediate transaction status and as a result, throughput may decrease (which is bad) while latencies decrease (which is good). In the long term, we may be able to piggy-back the committing EndTransaction on other Raft traffic to recuperate that loss. See #22349 which discusses similar ideas in the context of intent resolution.

I've been thinking about this in the context of implicit transactions that span multiple ranges because of secondary indexes. When we are given what could be a 1PC batch (i.e. has a BeginTxn req and an EndTxn req) but that spans ranges, we can actually avoid any extra WAL writes by writing the txn record with the STAGED status initially. This will avoid any extra WAL writes or other expensive work, and should be a strict improvement over the current state. The only effective difference is that the second write to the txn record will be done asynchronously, so the client will be acknowledged earlier.

In fact, I think the performance implication of the extra transaction status is effectively negligible in all cases where the final parallel batch already contains a write to range that contains the transaction record.

tbg mentioned this pull request Mar 25, 2018

perf: pipeline transactional writes at the KV layer #16026

Closed

tbg mentioned this pull request Mar 28, 2018

storage: improve EndTransaction's AbortSpan declaration #24285

Closed

tbg mentioned this pull request Apr 1, 2018

RFCs: add range merges RFC #24394

Merged

tbg force-pushed the technotes/parallelcommit branch from b65b4c7 to 7206092 Compare April 16, 2018 15:51

tbg force-pushed the technotes/parallelcommit branch from 7206092 to a2a46fc Compare April 17, 2018 10:40

vivekmenezes reviewed Apr 17, 2018

View reviewed changes

nvb mentioned this pull request Apr 23, 2018

client/kv: add unique sequence number to each Request in a BatchRequest #25007

Closed

nvb added the A-kv-transactions Relating to MVCC and the transactional model. label Apr 24, 2018

nvb mentioned this pull request Jun 1, 2018

roachpb: introduce QueryIntent request #26335

Merged

nvb mentioned this pull request Jun 21, 2018

mvcc: support idempotent transactional writes #26915

Closed

craig bot merged commit a2a46fc into cockroachdb:master Aug 15, 2018

tbg deleted the technotes/parallelcommit branch August 20, 2018 13:44

nvb mentioned this pull request Sep 14, 2018

perf: reconsider crypto/rand use in UUID allocation for txn IDs #30236

Closed

nvb mentioned this pull request Oct 5, 2018

kv/storage: support parallel transaction commits #30999

Closed

		The proposed changed condition (referred to as the commit condition) is:

		> A transaction is committed if and only if a) there is a transaction record with status COMMITTED, or b) one with status STAGED and all of the intents written in the last batch of that transaction are present.

Conversation

tbg commented Mar 25, 2018

Uh oh!

cockroach-teamcity commented Mar 25, 2018

Uh oh!

bdarnell commented Mar 25, 2018

Uh oh!

tbg commented Mar 26, 2018

Uh oh!

tbg commented Mar 26, 2018

Uh oh!

tbg commented Mar 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danhhz commented Mar 26, 2018

Uh oh!

bdarnell commented Mar 26, 2018

Uh oh!

bdarnell commented Mar 26, 2018

Uh oh!

tbg commented Mar 26, 2018

Uh oh!

bdarnell commented Mar 27, 2018

Uh oh!

codecov-io commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

danhhz commented Mar 27, 2018

Uh oh!

spencerkimball commented Mar 27, 2018

Uh oh!

tbg commented Mar 28, 2018

Uh oh!

nvb commented Mar 28, 2018

Uh oh!

tbg commented Mar 28, 2018

Uh oh!

bdarnell commented Mar 30, 2018

Uh oh!

tbg commented Mar 30, 2018

Uh oh!

tbg commented Mar 30, 2018

Uh oh!

tbg commented Mar 30, 2018

Uh oh!

tbg commented Apr 1, 2018

Uh oh!

a-robinson commented Apr 1, 2018

Uh oh!

bdarnell commented Apr 1, 2018

Uh oh!

tbg commented Apr 2, 2018

Uh oh!

tbg commented Apr 2, 2018 via email

Uh oh!

tbg commented Apr 2, 2018 via email

Uh oh!

bdarnell commented Apr 2, 2018

Uh oh!

tbg commented Apr 16, 2018

Uh oh!

andreimatei commented Apr 17, 2018

Uh oh!

tbg commented Apr 17, 2018

Uh oh!

vivekmenezes Apr 17, 2018

Choose a reason for hiding this comment

Uh oh!

nvb commented Apr 17, 2018

Uh oh!

nvb commented Jun 9, 2018

Uh oh!

bdarnell commented Jun 11, 2018

Uh oh!

tbg commented Aug 15, 2018

Uh oh!

craig bot commented Aug 15, 2018

Build succeeded

Uh oh!

tbg commented Mar 26, 2018 •

edited

Loading

codecov-io commented Mar 27, 2018 •

edited

Loading