Conversation
|
I'll have more comments on the substance of these proposals later, but first let me say that these should be RFCs instead of tech notes. RFCs are for proposals; tech notes are for describing the way things are. Could the new STAGED transaction status be shared with XA prepare support (#22359) or is it orthogonal? |
I'm not opposed to turning this into an RFC when the time comes. I assume the same holds for follower reads then? |
|
From a casual skim of the description, the XA support seems very similar to the |
|
(and the commit latencies are not improved, but get worse -- come to think of it, the only similarity is really the introduction of a new transaction status). When a transaction is |
|
Review status: 0 of 2 files reviewed at latest revision, all discussions resolved, all commit checks successful. docs/tech-notes/parallel-commit.md, line 116 at r1 (raw file):
Is this the same as #23258? Comments from Reviewable |
|
Reviewed 1 of 2 files at r1. docs/tech-notes/node-txn-ids.md, line 18 at r1 (raw file):
A 4-byte node id plus a 12-byte timestamp is the same size as a UUID, so this is only more compact than the status quo when we're able to elide the timestamp (or use varint encoding). docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):
So you're proposing that MVCC keys change from (user key, timestamp) to (user key, timestamp, gateway node ID, delta from orig timestamp)? Spell this out, including the byte-level encoding. How would you manage the migration? If we're going to make a change at this level we may want to consider it in conjunction with a change to the way historical versions are stored. docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file):
In some ways, this feels like a step backwards - it assumes that there is a single unchanging coordinator for the transaction. We're moving towards more parallelism of transactions and we aspire to one day handle coordinator failover (may not be practical until we swap out the pgwire protocol for something custom). docs/tech-notes/node-txn-ids.md, line 30 at r1 (raw file):
As an alternative to this, we could have the coordinator and every node that encounters a conflict maintain a streaming RPC to the leaseholder of the txn record's range. The heartbeat status could be maintained in memory on that range instead of going through consensus writes (on lease transfer, we could simply reset the clock on all pending txn's heartbeats). This gives us some of the benefits here without major low-level changes. docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):
Once we stop sending no-op writes through raft (#23942), the consensus latency will mostly go away. I think the things that still go through consensus at that point will still need to. docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):
We could also address this issue of idempotency by storing information about committed transactions outside of the values themselves (a CommitSpan, similar to the AbortSpan. We used to have something like this in the ResponseCache, but removed it because the extra I/O and storage was deemed unnecessary). Comments from Reviewable |
|
Reviewed 1 of 2 files at r1. docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file):
All intents must be written to the transaction record when setting its status to STAGED, right? docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
What exactly is a "missing intent"? What ensures that it won't be writeable in the future? So the difference between "basic" and "extended" is that the extended proposal sends ResolveIntents in parallel with changing the txn's status from STAGED to COMMITTED? docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file):
This improves latency but could reduce throughput because docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file):
Resolve(k1) should go in the same batch as Commit(k1). If these are not in the same batch, they'll end up getting serialized by the command queue. docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file):
What is the significance of this crash? docs/tech-notes/parallel-commit.md, line 194 at r1 (raw file):
How does this determine that k3 is missing and not simply in flight? docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):
In the basic protocol, a pusher can change the txn record from STAGED to ABORTED. But in the extended protocol, this must be disallowed, because the coordinator may already be in the process of resolving the intents. The extended protocol must make one of the intents unresolvable before it can change the txn status. Comments from Reviewable |
|
Thanks for the comments, @bdarnell! I've responded to them but not updated the document yet. BTW, just in case you want to trim your efforts, I think the reasonable proposal to discuss is the basic DistSender one. The node timestamp proposal was required to make the extended DistSender proposal work, but I think this is a bit further out and it accordingly it's a lot less thought out. Going from the basic to the extended DistSender proposal should be pretty straightforward when the discovery problem is solved, so going to basic first seems natural. Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/tech-notes/node-txn-ids.md, line 18 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yes, the reduction would be to 4 bytes in the common case since the timestamp is identical. For the case in which it's not, I was thinking we would varint encoding. Note that the new data lives in the values, not the keys. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I haven't fully thought out what I'm proposing (but something like this is necessary for the extended parallel commit proposal). I thought the new fields would be added to the values. There appears to be no reason to add them to the keys. All you need to be able to do is to check the value at a timestamped version and retrieve the ID. That also makes the migration straightforward if you populate the new fields only after a cluster version bump. In fact it seems so straightforward that I'm worried I'm missing something. Backup/restore would definitely need to pick up these new fields too, so there would be a concern about restoring data that was backed up before the migration fired, though the concern would be somewhat theoretical. I'm actually not sure how restore deals with intents. Isn't there a correctness problem anyway?
Maybe backup/restore does something smarter with intents. Just checking here. docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, in that regard, it does feel like a step backwards. But it doesn't close the door to communicating via the txn record in general, just makes it not the default. Also this parallel writes and failing over coordinators seems to me like one of the things we're better off postponing basically forever (I might be wrong here). docs/tech-notes/node-txn-ids.md, line 30 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, that could be an alternative. I'll include it as such. docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Right. A general reworking of the txnwaitqueue which currently isn't very smart addresses some of these issues. I do like the perceived simplicity of calling into the coordinator, but I agree that there are ways to get all that without putting a docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
The docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No, you only need the intents that belong to the last batch. But practically I think we would write all of the intents into the record at this time (but making sure to split them in two sets, "last batch" vs "all previous batches"). docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
a missing intent is one which is mentioned in the intended writes set in the transaction record, and is found as missing during the discovery process (which in turn also prevents it from being written in the future if the discoverer has the intent to abort the txn if possible, which should be the only reason for running the discovery process). The name "discovery" should change to something that implies the mutating nature. Open to suggestions, bit of a shame "resolution" is already taken. docs/tech-notes/parallel-commit.md, line 116 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Not the same reason, but the same wish, yep. Thanks for linking in the issue. docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file): It could resolve throughput. We're doing slightly more work.
is correct, as an artifact of not being able to "amend" a kv pair. The commit wouldn't add any keys but it needs to flip a bool, so it has to cycle it all in and out, unfortunately. docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree that they should get in the same batch and will update the example and outline how this would be achieved. But why would they get serialized? A commit only reads the abort cache, a committing resolve doesn't care about the abort cache. What's the key blocking them? (In the example, docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It makes sure that the coordinator (whoever receives the TransactionAbortedError) doesn't clean up after itself. No significance other than that. I'll update this example to make that clear, agreed that it invites confusion. docs/tech-notes/parallel-commit.md, line 194 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
The full story is a bit more complicated than I was able to squeeze into the example. What I assume here is that the conflicting client is willing to abort the transaction. So what it does is not only determine that k3 is missing, it prevents the intent from being written (see "Aborting a transaction" below). I'll clarify this. docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
With both proposals, the pusher can only ever change from There is a subtlety that I will add to this section, which is that preventing the write via the timestamp cache only prevents the current epoch of the transaction. We have to make sure that we don't fall into the following trap:
So after preventing one of the writes, the staged transaction record can only be aborted if it hasn't since changed its provisional commit timestamp. Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
OK, adding the new information to roachpb.Value (or changing the on-disk values to a new tag-compatible proto) is more feasible than changing the key encoding. It's still a pretty far-reaching change. BACKUP (and dump) should resolve any intents (waiting out any pending txns) before proceeding. (unverified) docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
@spencerkimball had some ideas today about making the txnWaitQueue smarter. docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
If you write all intents, then anyone who stumbles across a staged transaction can verify and commit it. If you only write the ones in the last batch, then only the original gateway can commit it and anyone else who encounters the intents will have to abort it. This feels like it might lead to more aborts compared to the current design, but I'm not sure (depends on how the txnWaitQueue works for staged txns, I think) docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
My question is what is the mechanism that prevents the intent from being written in the future? [ok, I see below that it's the tscache] What if
What about commands like CPut that only sometimes write an intent? We have to populate the txn record's list of writes before we know whether an intent will be written or not. docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
If the txn writes go to the same memtable, they'll only be written to one sstable, but there will still be two WAL writes. docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
EndTransaction currently declares the abort span for read/write whether it's a commit or abort: cockroach/pkg/storage/replica_command.go Line 165 in f111ff4 Although I think you're right that it doesn't actually need to unless it's an abort and we're resolving local intents. docs/tech-notes/parallel-commit.md, line 191 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ah, OK. I read this as the node responsible for k2 crashing. docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):
OK, I missed the fact that CheckIntent populates the timestamp cache. (I was thinking it was like the tscache-oblivious ResolveIntent. But to the point above, this requires that all intents be written to the txn record when it is staged so that the txn can become committed even if the gateway node is gone. Comments from Reviewable |
Codecov Report
@@ Coverage Diff @@
## master #24194 +/- ##
=========================================
- Coverage 75.65% 74.76% -0.9%
=========================================
Files 867 845 -22
Lines 127703 125960 -1743
=========================================
- Hits 96609 94168 -2441
- Misses 23887 24762 +875
+ Partials 7207 7030 -177
Continue to review full report at Codecov.
|
|
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):
Yup. If backup hits an intent at or below the timestamp it's running at, it returns WriteIntentError and gets retried until no intents are outstanding. (So, backups never include intents) Comments from Reviewable |
|
The changed transaction IDs proposal would be much better if it left out the direct-to-coordinator communication. I think that should not be conflated with the idea of a txn ID which is collapsed into each versioned value, because I find that part of the proposal not particularly useful and overly complicating. It just adds redundant mechanisms and IMO will make the system more difficult to understand, debug, and improve. The node ID should be part of the txn ID only because it provides uniqueness, not to add unnecessary new coordination through another node. That said, I think the changed transaction IDs, despite the complexity they add, enable three key wins that justify them:
Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. docs/tech-notes/node-txn-ids.md, line 9 at r1 (raw file):
We should discuss this. I'm not convinced this should be a feature of CDC, although it keeps coming up. I'd like to understand the motivation behind exposing anything about transactions to CDC. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Adding a delta in the value which adjusts the [possibly incorrect] timestamp in the MVCC key adds non-trivial conceptual complexity. This change must enable big benefits to justify it. docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Adding another way for transactions to communicate / coordinate is complicated; the benefits must be sizable. Also, while Cockroach inherently handles scale for the current transaction coordination mechanism by splitting and rebalancing ranges in which transaction records live, the txn coordinators are more problematic, because they're easy targets for imbalance of client requests. So that is another drawback worth considering, in addition to the fact that referencing the gateway node ID specifically makes that association far more immutable than it is in principal, if not practice, today. docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This section about communicating with the coordinator directly has some gaps. There is no consensus now when waiting on a transaction to commit / abort in the push txn queue. The push txn evaluation fails, and the water gets added to the queue. Waiters are returned when the waitee is finalized. The only time we have consensus on a @bdarnell, the idea of adding deadlock detection again, elsewhere, should raise hairs on the back of your neck. @tschottdorf, I think we can do much better on the txn wait queue for cases where there is dramatic contention. but that's a separate conversation. docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I like this property. docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree. If you don't write all of the intents, then a concurrent reader / writer who is attempting to commit the staged transaction would otherwise not create a finalized transaction record that contained all of the intents. This could mess up eventual GC and lose writes via still-unresolved intents after the transaction record is cleaned up. Comments from Reviewable |
|
@spencerkimball I agree that the coordinator based conflict resolution is not a clear win (and for that reason it's not mentioned in the parallel commits note). I'll make this less central and point out some pros and cons more concisely. Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. docs/tech-notes/node-txn-ids.md, line 9 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Yeah, it's also a requirement that I'm not fond because the implementation is awkward. I think @danhhz is aware that this is something we'll need to figure out. I don't think it's set in stone at the moment. The reason I mention it here is because it would enable that should it be required, I'll make that clearer though. docs/tech-notes/node-txn-ids.md, line 26 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
The idea is that it would remove the current mechanisms and the long tail of complexity that comes with docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):
It definitely wouldn't be added again, but moved. docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Yeah, that's right. The committed record must contain all intents and there's no hope of getting that (in general) if the staged one doesn't. But you only need to touch the intents for the last batch when dealing with a STAGED record. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
Yeah, our current transactional model is flexible and allows you to continue using the transaction after an error, though as far as I'm aware we don't do that in practice. The idea would be that we stop doing that, at least in the last batch. Once you lay down a staged entry, it's all or nothing, you don't get to try again. (the proposal could be extended, but it adds more complexity and I don't think we should go there). docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point. Yeah, that's a drawback. These losses can be recovered using the ideas in docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Filed #24285. Comments from Reviewable |
|
Thanks for splitting out the basic and extended proposals for parallel commits. To summarize our previous conversation at the risk of oversimplification, the tradeoff between the two approaches and the current state of Cockroach (in the happy path) comes down to: So when we think about the effect of these two proposals, it might be fair to say that
Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. docs/tech-notes/parallel-commit.md, line 44 at r1 (raw file):
One interesting thing to note here is that this would currently be considered a 1PC transaction, which means that we would omit writing a txn record entirely. If there are other parallel batches then we'll need to avoid this optimization and write a txn record in the docs/tech-notes/parallel-commit.md, line 109 at r1 (raw file):
I may have missed it, but nowhere in this proposal are you suggesting that we wait until STAGING time to write the transaction record, right? So you actually meant the batch which is STAGING the transaction record. I think it's important to note somewhere that not only is this a set of point writes that the batch which commits the transaction record is intending to do, but they are a set of point writes that the batch which commits the transaction record is promising to do. Once a transaction moves into the STAGING state, it must succeed in writing an intent at every single one of these keys before being marked as COMMITTED. @bdarnell may have been alluding to this earlier, but to be clear, this means that all supported operations that can be sent in parallel with a commit must write an intent on success, no matter what. We'll have to be careful about write requests that have no-op paths. This is a concern I've had about #16026 (comment) as well. docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file):
Why does docs/tech-notes/parallel-commit.md, line 217 at r1 (raw file):
See my comment above. I think there is a subtle change here. Comments from Reviewable |
|
@nvanbenschoten That's a good summary, I put that in the intro. Pushed an update that mostly tries to improve the presentation (though I'm sure there are some easter eggs left). In adding more examples I also found an important case that I was neglecting, namely that in which one of the final writes actually pushes the timestamp (or sets the write_too_old flag). I haven't fleshed that out yet, but it's a pretty central case that I think we can handle but need to see how to efficiently do so. I haven't bothered with the node id note for now, especially now that the parallel commit note focuses on the basic proposal and has a clean interface to the other one (we need txn ids in the values). Review status: 1 of 2 files reviewed at latest revision, 19 unresolved discussions. docs/tech-notes/parallel-commit.md, line 44 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Heh, that's a good point, but for the purpose of the example, we should forget there's a 1PC path. I added a sentence to that effect. docs/tech-notes/parallel-commit.md, line 94 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/tech-notes/parallel-commit.md, line 109 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Very good point about no-op writes, added a section. The proposal doesn't care when you write the record. In the interest of not conflating more than I already have, I'd just stick with what's there: docs/tech-notes/parallel-commit.md, line 136 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/tech-notes/parallel-commit.md, line 147 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Example is hopefully in better shape now, but I'm sure I missed some things still. docs/tech-notes/parallel-commit.md, line 203 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This has changed to a read request (as it ought to) in the latest proposal. Wasn't thinking about #16026 in particular but it may be possible to do have common implementation. The one in #16026 needs to be fast, with the one here that's less critical. docs/tech-notes/parallel-commit.md, line 217 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Updated the wording, but there shouldn't be a real change here as discussed. Comments from Reviewable |
|
Reviewed 1 of 1 files at r2. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
The timestamp delta is essentially a specialized compression scheme. I think it can be pretty well isolated at the lower levels, so it should be manageable. But I'm still not sure it's worth it. The snappy compression that's already used may already be enough to reduce the size of the timestamp (especially for the common case when it is unchanged) docs/tech-notes/node-txn-ids.md, line 32 at r1 (raw file):
Definitely. I'm advocating making the txnWaitQueue (and all the push/conflict machinery around the transaction record) smarter instead of moving responsibility away from it. docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):
That's true of the AbortSpan too. If you manage to get a long running/replayed transaction after the 1h gc threshold, you can see anomalies. For a CommitSpan, the problem is that A) commits are more common than poisoning aborts so we must control the cost of these extra records and B) a GC'd AbortSpan record can only cause anomalies in transactions that are unable to be committed anyway, while a GC'd CommitSpan could (maybe?) allow a 1PC txn to apply twice. If we're not too worried about B (and we probably shouldn't be with the min proposal timestamp blocking things from happening long after the fact), we could GC the transaction IDs out of the values after enough time has passed. We could even do this as a rocksdb compaction filter so we don't generate extra I/O to do it. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
I see now that this is handled by the fact that checking the intent updates the timestamp cache so the delayed write to key B can't happen.
OK, I guess it's the case that now all of our transactional writes either write an intent or return an error (the no-ops that @nvanbenschoten is looking at are all non-transactional operations). But again, this seems like it might be moving in the wrong direction. We've talked about making ConditionFailedError a kind of result so you could more easily continue past it, and the general push to move more predicates and filters from SQL down to KV could lead to more kinds of conditional writes. It would be unfortunate if we had to give that up and force consensus writes for these no-ops. (Limiting this to writes in the last batch would mitigate that somewhat, at the expense of yet another special case that impacts transactional correctness) docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):
Here's an alternative to embedding the transaction ID: make The downside to this is that it pushes timestamps on conflicts that wouldn't today, but I think that's probably not a big deal. Most SQL writes use CPut which updates the timestamp cache anyway, the txnWaitQueue makes aborts relatively rare, and RefreshSpans makes timestamp pushes less painful when they happen. docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):
Shouldn't the error semantics be inverted now that this has been renamed to PreventIntent? It returns nil and updates the tscache if no intent is found, and returns an error if the value is there (or just do everything in the return value instead of using errors) Updating the tscache should be cheap, since in the cases where we don't need to update it someone else will probably have updated the tscache for this key anyway. docs/tech-notes/parallel-commit.md, line 156 at r2 (raw file):
What component is responsible for doing this? Is it part of the docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file):
Add the delay step from the "when to run this" paragraph. docs/tech-notes/parallel-commit.md, line 161 at r2 (raw file):
Talk more about how you can restart after writing a STAGED record. How do we discover that the provisional commit timestamp changed? Doesn't this effectively prevent us from resolving intents in parallel with marking the transaction as aborted? (maybe we don't care about that for aborts and only need to parallelize resolves for txns committed by their original coordinator) docs/tech-notes/parallel-commit.md, line 163 at r2 (raw file):
This is an interesting constraint. So far, a coordinator can always choose to abort its own transaction (even if it is already committed, the abort attempt will fail safely). This is embodied in methods like TxnCoordSender.tryAsyncAbort. We'll have to be careful to guard against aborting staged transactions (maybe changing tryAsyncAbort to use PushTxn, even though we just rejected such a change). Comments from Reviewable |
|
Only comment responses in this round (i.e. no updates to the text). Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file):
Yeah, maybe. Switching the txn id scheme is the big deal. Whether we add the specialized compression here is a small difference (I'd guess it's probably worth it, but don't know that). docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file):
Yeah, but these values would have to be replicated (ok, maybe they don't, and we just lose idempotency on lease change -- seems OK), so doing that is roughly as complex as making GC use compaction filters (which I think we'll end up doing (again) one day). We'd need to teach the consistency checker these semantics. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
Could you expand on that? As I see it, the impact I assign to parallel commit vastly outmatches the importance of optimizing for noop Luckily I think you can have your cake and eat it too: You can actually contain this in DistSender by falling back to the basic proposal in these cases. When any writes comes back as no-ops (they must tell you) in the last wave, docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This only solves one half of the problem but now I see the convenient way out and I don't think we need to embed the txn ID everywhere. Exciting!! The interesting half of the problem is if you can't prevent any of the writes. This means that for each key you check, you either find an intent of that txn at the right timestamp or a versioned value at the right timestamp. Let's assume that there's at least one versioned value; otherwise you're home free because the txns tell you whether the transaction is committed (note that the original client, even if it didn't receive all of the responses, must do proper status resolution so that it doesn't clobber around here while we decide). So now we're in serious trouble. Did the transaction actually write everything and resolve it, or did it lay down a bunch of intents and then failed on that version at the same timestamp?[1] If the latter, then we might (always?) see an intent of the transaction at a higher timestamp. If the "always" is true, then that actually solves the problem already. I'm just not sure that's really true (and whether we want to make it required forever; so far we do this to avoid starvation). It's definitely not true for 1PC transactions today (though that looks changeable). Assuming we're still in need of a solution, if we're willing to eat a little bit of probabilistic issues, docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This was originally called docs/tech-notes/parallel-commit.md, line 156 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That's an implementation detail, but since it has to be done by various parties (anyone seeing intents, also docs/tech-notes/parallel-commit.md, line 161 at r2 (raw file):
Will do, but haven't yet.
You have to check the timestamp as you attempt to change the status of the txn (so it's like a
Yes, that's correct. You run status resolution, then read the txn record, then do stuff in parallel (or update the txn record first, but then you eat consensus latency for aborting the intents). This logic is needed by everything including a simple push, so I think it needs to be an admin command that replaces PushTxn. (See the other comment below). docs/tech-notes/parallel-commit.md, line 163 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ah, disregard that last post. You can't just run the basic proposal. You have to wait to tell the client about the commit because you could get aborted. Comments from Reviewable |
|
Note to self: a "promised write" also needs to contain the sequence number (in congruence with #16026). Otherwise, an intent written earlier in the life of the transaction and attempted-but-failed-to-be-overwritten in the last batch would not be detected as missing. I have an inkling that this proposal is one that could actually profit from being model checked. Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. Comments from Reviewable |
|
Note to self (2): the STAGED transaction record also has to contain the |
|
Reviewed 1 of 2 files at r1, 1 of 1 files at r2. docs/tech-notes/parallel-commit.md, line 132 at r2 (raw file):
It'll be interesting to benchmark whether the threshold for a large transaction to switch to approximate intent spans is also a good threshold for this. The cost of checking for promised intents may get expensive way before that point (especially if we don't really nail down the heuristic of how long to wait as described in the "When to run this" section) docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It'd be nice if the delay happened as part of the Of course, if we change to resolving the transaction status by talking with the gateway, this problem goes away. docs/tech-notes/parallel-commit.md, line 198 at r2 (raw file):
Nit, but these would be much easier to read if they were tables (or images with arrows indicating dependencies). Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I think turning it into an RFC and merging it with a status of rejected or postponed would make sense, just to capture the discussion (this wouldn't have to meet the quality bar for an accepted RFC). We could also just leave the doc in a branch on your fork but that would leave it vulnerable to accidental deletion. docs/tech-notes/node-txn-ids.md, line 55 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, it's a little more complicated than GC via compaction filter because this would be rewriting data that is still live for reads. We'd have to special-case it in the consistency checker. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file):
I think I agree, but we've been planning for a while to push more sql processing down to the KV layer and this has just come out of the blue. But I agree that improving best-case write latency is really important and this is one of our few avenues to big wins in this area. docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
There appears to be a dangling footnote I'm afraid I'm going to need a more detailed walkthrough of the problematic case here and how it's not true in 1PC txns today. The hash solution is worth considering but I'd rather make sure we've covered all the alternatives first. docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Hmm. docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
A delay in command evaluation seems like a bad idea since that blocks the command queue. Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. docs/tech-notes/node-txn-ids.md, line 20 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Ok. I'm going to let this fall behind for now. We'll keep the reviewable ref forever, right? Still committing it somewhere is nicer for discoverability. But the document needs a good reworking before I can justify that. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, doesn't hurt to take it into account. The pushdown is exclusively for reads, so I'm not concerned. Plus, distributed writes seem far away. But I'm also always surprised how much talk there is about it and have found that premature, so perhaps I'm the wrong judge. FWIW, as you've pointed out, running I'm still failing to see this become important, but if it does, I think the above solution works well, and the layers above docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file):
If you're confused, it's probably because I celebrated too early and looked only at the easy half of the hard half. (i.e. the hash solution is the best I've got so far). For a very concrete example, let's say you have a transaction that in the last batch lays down a write on ranges 1 and 2 and commits on range 3. With the extended proposal, and assume that the Now consider the problem the status resolution process has to solve. It sees the staged record and tries to prevent the writes at Now it looks at (my mistake was here - I was trying to use the fact that there isn't an intent here as proof that the gateway had not entered the commit phase but that's not true - it may have and all the intent resolutions + the commit may have failed, with the exception of the one to this key) So the hash is the best I can do at this point. docs/tech-notes/parallel-commit.md, line 132 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Note that we usually never have to check promised intents. That happens only when a coordinator dies with a staged transaction record. Usually you still just push the transaction record from PENDING to ABORTED and that's that. So, it's OK if this is a little expensive! You are already dealing with a system in which there are unavailable ranges for ~4 seconds (and in fact some of those you may end up having to wait for). Besides, checking for the point intents is always strictly cheaper than successfully running the last batch of the txn. This is because you're just trying to read all of the intents in the worst case (not even returning any data), but the last batch had to put intents on all of them. So there's not really any throttling necessary, assuming we're reasonably good at not creating dangerously large last batches in the first place. The limiting factor is really how much we're willing to plop into a txn record (but, we can shard it rather easily!) For ranged intents, this section is still a little confused (and I shall update it), thanks for bringing my attention to that. First, So it may actually be feasible to lift this restriction. I think that's well worth doing. docs/tech-notes/parallel-commit.md, line 152 at r2 (raw file):
btw, I had thought about that, but then the natural next step is to send this in a batch with a limit, but DistSender is notoriously bad at parallelizing those, so that wasn't a good idea either.
Warms my heart less than docs/tech-notes/parallel-commit.md, line 158 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think I wasn't particularly clear with the delay. This is because I backed out some ideas here and didn't update it accordingly. Basically the current contention resolution stays the same. You communicate via the txn record and the "delay" really just means "wait until the txn record is pushable based on heartbeat timeout". You don't eat the delay usually. You need a very specific situation in which a coordinator managed to docs/tech-notes/parallel-commit.md, line 198 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I'll see what I can do. Comments from Reviewable |
|
Another note to self: if we include range promised writes, we may still run
into problems for the case in which there is a limit. In particular, when
that limit is on the batch. Though that limit forces DistSender into serial
execution, --
…-- Tobias
|
|
... so it's a lot of code golf and, I think, not anything we need to
address architecturally. Will add a section.
On Sun, Apr 1, 2018, 10:06 PM Tobias Schottdorf ***@***.***> wrote:
Another note to self: if we include range promised writes, we may still
run into problems for the case in which there is a limit. In particular,
when that limit is on the batch. Though that limit forces DistSender into
serial execution, --
-- Tobias
--
…-- Tobias
|
|
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. docs/tech-notes/parallel-commit.md, line 96 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
docs/tech-notes/parallel-commit.md, line 27 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ack. Unfortunately, I don't think a hash can work either. Suppose the put at k1 is an insert, and the put at k2 is incrementing a counter. The conflicting transaction is an insert at k4 and incrementing the same counter at k2. This will hash identically, but one of the counter increments would be lost. I think the extended proposal will require transaction IDs to be stored in the persisted value. Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 22 unresolved discussions. docs/RFCS/20180324_parallel_commit.md, line 258 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That's a good point, I hadn't looked at this code closely. Ultimately there will be some refactoring here because the implementation of how spans are refreshed really isn't sane use of docs/RFCS/20180324_parallel_commit.md, line 300 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Updated. I mean both what you say but also the eager cleanup after resolving the external intents. Used your sentence as well. Comments from Reviewable |
b65b4c7 to
7206092
Compare
|
Review status: 0 of 1 files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. docs/RFCS/20180324_parallel_commit.md, line 202 at r6 (raw file):
This is something that I think is not very clear throughout the RFC - there's always the possibility of a race between a concurrent txn doing status resolution and a late write from the main txn. We don't want to abort transactions aggressively. I guess this is where the check in the status resolution process for abandoned transactions comes into play, right? Maybe you can make this more explicit. docs/RFCS/20180324_parallel_commit.md, line 274 at r6 (raw file):
can't we use the docs/RFCS/20180324_parallel_commit.md, line 349 at r6 (raw file):
A different write on the same key would be OK thought, wouldn't it? docs/RFCS/20180324_parallel_commit.md, line 355 at r6 (raw file):
What do you mean "otherwise"? There's always a transaction record, isn't it? Comments from Reviewable |
Cut the commit latency for the final batch of a transaction in half, from two rounds of consensus down to one, by addressing a shortcoming of the transactional model which does not allow `EndTransaction` to be sent in parallel with other writes. This is achieved through the introduction of a new transaction status `STAGED` which can be written in parallel with the final batch. With proper batching, this change translates directly into a reduction of SQL write latencies. The latency until intents are resolved remains the same (two rounds of consensus). Thus, contended workloads are expected to profit less from this feature. Release note: None
|
Review status: 0 of 1 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. docs/RFCS/20180324_parallel_commit.md, line 202 at r6 (raw file):
so I assume you were confused about this before this paragraph? I added something to the summary,
Does that help at all? docs/RFCS/20180324_parallel_commit.md, line 274 at r6 (raw file): Previously, andreimatei (Andrei Matei) wrote…
The abort cache is updated via Raft writes, so this would turn docs/RFCS/20180324_parallel_commit.md, line 349 at r6 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Yeah, in theory you could do that (if you also used the same sequence number), but it's a slippery slope. The code making the decision of which key to write next wouldn't be aware of any of these constraints. docs/RFCS/20180324_parallel_commit.md, line 355 at r6 (raw file): Previously, andreimatei (Andrei Matei) wrote…
The next paragraph explains this (the txn record could have become explicitly resolved and subsequently GC'ed). Comments from Reviewable |
7206092 to
a2a46fc
Compare
| The proposed changed condition (referred to as the **commit condition**) is: | ||
|
|
||
| > A transaction is committed if and only if a) there is a transaction record with status COMMITTED, or b) one with status STAGED and all of the intents written in the last batch of that transaction are present. | ||
|
|
There was a problem hiding this comment.
I found the term "last batch" confusing
|
Review status: 0 of 1 files reviewed at latest revision, 27 unresolved discussions. docs/RFCS/20180324_parallel_commit.md, line 162 at r7 (raw file):
Make a note here that in the first case the EndTxn req is sent with a status of docs/RFCS/20180324_parallel_commit.md, line 187 at r7 (raw file):
Where does all of this happen? Still in docs/RFCS/20180324_parallel_commit.md, line 217 at r7 (raw file):
make a note that we skip the docs/RFCS/20180324_parallel_commit.md, line 234 at r7 (raw file):
Make a note why this doesn't need to include an epoch number. docs/RFCS/20180324_parallel_commit.md, line 234 at r7 (raw file):
No need to change anything here, but let's create this as a separate proto message called docs/RFCS/20180324_parallel_commit.md, line 272 at r7 (raw file):
I don't see any reason why this can't be used in place of the proposed docs/RFCS/20180324_parallel_commit.md, line 274 at r7 (raw file):
docs/RFCS/20180324_parallel_commit.md, line 278 at r7 (raw file):
This is easily possible with the docs/RFCS/20180324_parallel_commit.md, line 345 at r7 (raw file):
I'm curious what care you envision? Can we add an assertion somewhere on the write-path to enforce this? docs/RFCS/20180324_parallel_commit.md, line 368 at r7 (raw file):
When would an intent be found at the wrong timestamp? Comments from Reviewable |
This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both cockroachdb#24194 and cockroachdb#16026. It is a prerequisite for either proposal. Release note: None
This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both cockroachdb#24194 and cockroachdb#16026. It is a prerequisite for either proposal. Release note: None
|
A small point I realized with pipelined writes and believe applies here as well is that this algorithm and a lot of its assumption only apply to |
|
I'm not sure that changes much - snapshot transactions can have their intents pushed forward without aborting, but that's still a push operation on the transaction record, not something local to the intent. Still, I'd rather disable this optimization for SNAPSHOT (if SNAPSHOT survives at all: #26475) than spend too much time reasoning about whether it's safe. |
This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both cockroachdb#24194 and cockroachdb#16026. It is a prerequisite for either proposal. Release note: None
26335: roachpb: introduce QueryIntent request r=bdarnell a=nvanbenschoten This change introduces a new request method called QueryIntent. The request type draws parallels to the QueryTxn method, but it visits an intent instead of a transaction record and returns whether the intent exists or not. The request type also includes an "if missing" behavior, which allows clients to optionally ensure that the request prevents a missing intent from ever being written or return an error if the intent is found to be missing. This request type was proposed/discussed in both #24194 and #16026. It is a prerequisite for either proposal. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
|
Merging as a draft to clear out my PR queue. bors r+ |
24194: RFC: parallel commits r=tschottdorf a=tschottdorf The parallel commits tech note describes a proposal for reducing commit latencies. It comes in a basic and extended flavor, the latter of which requiring more engineering work. I believe we should pursue the former sooner rather than later, and make the latter contingent on the fate of the required second tech note, which suggests a new format for transaction IDs and explores some of the expected benefits. Release note: None 28589: opt: fix panic caused by select from table with no columns r=rytaft a=rytaft This commit fixes a panic caused by running `SELECT *` from a table with no visible columns (e.g., a table with only the hidden rowid column). The bug was caused by the `optbuilder` creating a `Presentation` slice even when there are no columns to present. Ensuring that the slice is nil in this case fixes the bug. Fixes #28388 Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
Build succeeded |
I've been thinking about this in the context of implicit transactions that span multiple ranges because of secondary indexes. When we are given what could be a 1PC batch (i.e. has a In fact, I think the performance implication of the extra transaction status is effectively negligible in all cases where the final parallel batch already contains a write to range that contains the transaction record. |
The parallel commits tech note describes a proposal for reducing commit
latencies. It comes in a basic and extended flavor, the latter of which
requiring more engineering work. I believe we should pursue the former
sooner rather than later, and make the latter contingent on the fate of
the required second tech note, which suggests a new format for
transaction IDs and explores some of the expected benefits.
Release note: None