storage, kv: remove the max intent limit#21078
storage, kv: remove the max intent limit#21078spencerkimball merged 1 commit intocockroachdb:masterfrom
Conversation
|
Unittests TBD. @tschottdorf what do you think of this approach? |
|
Exciting! One random comment; it might be confused: intents are always written at And I've got a question: this change doesn't affect the way in which single large inserts are refused, right? Like, |
|
Good comment about epochs – that was an oversight which I've corrected. I had to add a new field to Transaction to track the earliest timestamp. Not too high a cost, but does decrease the elegance of this a tad. This won't change single large insert behavior. But it does let you do things like insert |
ae476da to
0cf7f8f
Compare
|
I like the approach! Bucketing by range is a good idea (I always pictured we'd somehow try to figure out proximity by looking at the keys, but that seems strictly trickier and probably worse). Using the time-bounded iterator is interesting too, but it deserves a lot more focused attention. Hence my suggestion is to defer the time-boundedness optimization to a subsequent PR (that may or may not make it into 2.0). That would imply that you don't have to keep track of the first epoch timestamp, and condenses this PR sufficiently (for me) to shepherd it into 2.0. It would be a real bummer to rush a txn bug into the release, and there's no pressing need for us to make large transactions "fast" (for some definition of fast that I'm not sure either with or without the time bounded iterator will be reached). I also don't think you quite get to put that release note down that you have there yet. We allow some forms of larger transactions, but SQL needs to implement its batching on top of it, as Andrei pointed out (there is also the option to let DistSender break up these large batches, but that's worse since it wouldn't address the problem that the insert would still have to fit into memory completely, which isn't a given for I only skimmed the code, so apologies for this question: do you ever merge buckets (i.e. ranges) or do you end up with one span per range touched in the worst case? Keys which already exceed your chunk size quote will be interesting to test (ideally they shouldn't go through the somewhat expensive reduction process every time). cc @danhhz how do you think this would play with #21097? We'd want to run transactions that write lots of intents. I suppose we could create an empty and then have the op run a very long transaction, right? Or create a table with 10k big rows across a few ranges, which are then Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file):
Why not just remove this? pkg/roachpb/data.go, line 927 at r1 (raw file):
For sanity, better to have this track the minimum non-zero seen. pkg/roachpb/data.go, line 1299 at r1 (raw file):
Hmm, I'm surprised this kind of code didn't exist elsewhere already. pkg/storage/gc_queue_test.go, line 742 at r1 (raw file):
Could you make this comment clearer (and make it a real comment, not an inline one)? pkg/storage/intent_resolver.go, line 653 at r1 (raw file):
This might be racy in subtle ways. Once you've passed Comments from Reviewable |
|
This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?). That raises the limit, but it also changes the character of the error you get when you hit it: it becomes a retryable error. This increases the potential for pathological retry behavior for large transactions (which I also noted in #21056 (comment)). I think a transaction size limit can be a useful safety mechanism. Range intent resolutions are expensive because they block untouched keys in the command queue. In my applications in most cases I'd rather disallow the transaction than have a threshold at which it silently switches to a much more expensive mode. Maybe we should keep a session variable in place so you have some control over whether large transactions are allowed. I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations. Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. Comments from Reviewable |
Why prohibitively expensive? I'm not sure why large transactions need to go from "fatal to the system" to "so fast you'd want to use them all the time" (which they wouldn't either way, I don't think). A point that I see is that it's undesirable for "regular" write operations that are just barely above the thresholds to block the whole range.
That's a good point. Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as With the points that @bdarnell has raised, I'd be more comfortable discussing an RFC rather than the actual code at hand. |
|
I’ve come to a conclusion in all of the work I’ve been doing recently that
SNAPSHOT is actually far more important than we’ve been giving it credit
for. We have basically been talking solely about restarts but I think
that’s aleays been a weak justification for even having them.
However, in the context of CDC and large transactions, they are a necessary
solution. So I do believe this allows for arbitrarily large transactions. I
believe the recommendation should be that SNAPSHOT is appropriate for such
workloads and instead of treating it as the red headed stepchild of
isolation levels, we should be embracing it as a mechanism with a definite
and important use case.
…On Fri, Dec 29, 2017 at 12:06 PM Tobias Schottdorf ***@***.***> wrote:
@bdarnell <https://github.com/bdarnell>
I think the time-bounded iterator is a necessary part of this change.
Without it, I think the intent resolution will be prohibitively expensive
(unless we require some sort of opt-in to lift the txn size limit). Even
with it, we'll need to be wary of accidentally-quadratic behavior as we do
more range operations.
Why prohibitively expensive? I'm not sure why large transactions need to
go from "fatal to the system" to "so fast you'd want to use them all the
time" (which they wouldn't either way, I don't think).
A point that I see is that it's undesirable for "regular" write operations
that are just barely above the thresholds to block the whole range.
This doesn't give you "arbitrarily large" transactions, it just moves the
limit from space (number of intents) to time (how much can you write before
you start running into the tscache low-water mark?).
That's a good point. Write-only transactions (which can automatically be
SNAPSHOT, and hopefully is most of the real-world large transactions) such
as INSERT ... FROM SELECT ... are not affected by this, though.
With the points that @bdarnell <https://github.com/bdarnell> has raised,
I'd be more comfortable discussing an RFC rather than the actual code at
hand.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#21078 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTXH_Rk9C_hCHs53H1TNFMFqyEqMBks5tFRwXgaJpZM4ROAKR>
.
|
|
@spencerkimball I don't disagree for write-only txns, but if you feel strongly about |
"Prohibitively" may be too strong, but I think having to scan base-layer sstables to look for intents (redundantly for each large transaction!) would put the performance of these transactions outside of reasonable expectations.
I go back and forth on SNAPSHOT - it's not SERIALIZABLE, but use of SNAPSHOT or weaker isolations are common without (much) ill effect. On the other hand, there's reason to believe that our SNAPSHOT has issues that non-distributed SNAPSHOT implementations don't (such as FK verification as disussed in #19876). This is an interesting non-restart-related use case for SNAPSHOT. However, the downside of locking the affected rows for the duration of the transaction seems significant.
I would turn this into "Low-latency CDC, large transactions, and serializability. Pick two". Giving up serializability requires an application-specific analysis of the required consistency, just like any reduction in isolation level. I think giving up large transactions is more likely to be the best option, followed by increasing the max transaction duration (increasing the latency of CDC). |
|
@tschottdorf thanks for pointing to that discussion. I think @bdarnell's first comment is spot on: these delete-range type anomalies are better described as "write skew". The @bdarnell, "giving up large transactions" does not feel like a permanent solution – and your choice of two of three desirable properties / capabilities feels like a fallacy. I agree we can't have them all at once, but there are shades of gray that are still incredibly useful, even vital. I would set up our general mission for design/engineering as follows: "deliver the capabilities necessary to perform workloads our customers desire to run, working with the constraints imposed by our evolving design." Are large transactions within our remit? I would argue they are, based on the general sense of pain that I experienced (and it sounds like @andreimatei did as well) when I saw the change to limit max intents go in in the first place. Now if handling 10MiB or 100MiB of keys in long-running transactions is possible with this change and |
Right. I was describing a multi-dimensional space (like "good, fast, cheap"), not three binary properties.
Yes, of course. Increasing our various arbitrary limits including transaction size will almost always make our product better. But it's a question of timing and opportunity cost. This change will have follow-up work to improve the SQL execution of certain changes that are not yet streamed, and may lead to firefighting. Is this the right limit to be fighting at this time? (According to airtable, this is on the roadmap for 2.2, although the github issue is tagged 2.0)
Do they? It seems to me that in most cases where very large transactions arise, it's for something like importing or cleaning up data in non-live tables and transactionality is not required. Repeating a DELETE with a LIMIT clause is not an aesthetically pleasing solution but it works.
Depends on the kind of anomaly. I don't think postgres will ever let you violate a foreign key constraint regardless of your isolation level. We do if you're in SNAPSHOT isolation. This seems like a big difference. |
|
@bdarnell, admittedly there is still work to do on this but as far as opportunity cost, this felt like an easy pickup. It took a day, and to me felt like a bug that should be cleaned up. The fact that there is still work on the sql side for some use cases which create large transactions seems beside the point. The use case I have in mind is straightforward and one I ran into recently while testing: I just want to open a txn and write a boatload of stuff. It didn’t seem like a big deal and keys were small and I hit the 100k limit and that seemed buggy. Your example about the multiple As to our snapshot mode allowing foreign key constraint violations, I wasn’t aware of that. How positive are we of that? Could you point me at the example? |
|
I think this is probably a good change, I'm just calling for more caution in how we implement it. The intent limit is a problem, but it is predictable and gives a message that accurately points to the transaction size as the culprit. If we remove it, we may trade it for other more subtle problems. I'm not looking forward to the first support request we get for problems related to large transactions. This change does more than enable transactions that would previously fail: it also changes some existing transactions (with fewer than 100k intents but whose keys add up to more than 256K) to use range intent resolutions instead of point resolutions. What will the consequences of this be? How will we test the behavior of large transactions and their impact on other traffic? What kinds of workloads do we need to be concerned about?
They shouldn't be "barred" from using a large transaction, but they also shouldn't be "barred" from using correlated subqueries, or complex joins, or updateable views, or from having tools to diagnose what's going on in their cluster. A mature SQL database has an enormous feature set. We only cover part of that, and of the part that we cover there are things that don't work as well as they should. At our current stage of development, expanding the set of things that are "supported" but don't work as well as they should would do more harm than good. I fear that large transactions would fall into this category.
As far as I know no one has put together a complete example yet, but FK validations do their reads at the OrigTimestamp, not the commit timestamp, so it should be straightforward.
|
|
@bdarnell thank you for the example. Yes that is indeed an example of write skew. It's really a shame that we didn't implement the @tschottdorf in my testing, where I modified the kv loadgen to generate huge transactions containing many batches of upserts, using big keys, not using a time bounded iterator shows a throughput curve that slopes down and to the right, and then bottoms out into a trickle. With the time bounded iterator, things actually look pretty good by comparison. I think it's necessary if we accept the goal is to make these kinds of transactions feasible. I'm going to work on an addendum to this PR which will allow Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Because I don't want the memory on the coordinator exploding either. This came up in testing. pkg/roachpb/data.go, line 927 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/roachpb/data.go, line 1299 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I've added a comment and made changes to my call site, because I think local spans should not be combined, period. pkg/storage/gc_queue_test.go, line 742 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/intent_resolver.go, line 653 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
210a39a to
5a8f933
Compare
The concept is straightforward: store read spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "update" all of the read spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all read spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `UpdateRead` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note: None
Writing an intent while doing reads for foreign keys would be incredibly expensive and inhibit concurrency. This would almost certainly be worse for performance than simply running in SERIALIZABLE. OTOH, if we wanted to use a KV operation internally that both reads a value and writes an intent, we could do that for internal use without exposing it as Anyway, I'm going to lay off this PR for a while and focus on #21056 and #21140. All of these PRs have some amount of risk, but I think the upside is much higher with the other two so I'm more than willing to work through the complexity there. |
5a8f933 to
d8a113f
Compare
The concept is straightforward: store read spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "update" all of the read spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all read spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `UpdateRead` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note: None
`(*rocksDBBatch).NewTimeboundedIterator` was caching the iterator in a way that would return the iterator in future calls to `(*rocksDBBatch).NewIterator`, which was a correctness problem. Luckily, the fallout was somewhat limited by the fact that we only use time bounded iterators when resolving an intent, and that is usually (but not always) the last operation carried out on the batch. This was introduced in cockroachdb#21078, which landed around Jan 15 2018, so the bug has not made it into a release yet, though we'll have to bump our alpha (cockroachdb#21657). Release note: None Fixes cockroachdb#21689.
The concept is straightforward: store spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "update" all of the spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `UpdateSpan` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. Note that `UpdateSpan` has a bool indicating whether it should update the read or write timestamp cache. Only `DeleteRange` creates update spans which need to update the write timestamp cache. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note (sql): significantly reduce the likelihood of serializable restarts seen by clients due to concurrent workloads.
The concept is straightforward: store spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "update" all of the spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `UpdateSpan` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. Note that `UpdateSpan` has a bool indicating whether it should update the read or write timestamp cache. Only `DeleteRange` creates update spans which need to update the write timestamp cache. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note (sql): significantly reduce the likelihood of serializable restarts seen by clients due to concurrent workloads.
The concept is straightforward: store spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "refresh" all of the spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. Note that `Refresh(Range)` has a bool indicating whether it should update the read or write timestamp cache. Only `DeleteRange` creates update spans which need to update the write timestamp cache. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note (sql): significantly reduce the likelihood of serializable restarts seen by clients due to concurrent workloads.
The concept is straightforward: store spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "refresh" all of the spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. Note that `Refresh(Range)` has a bool indicating whether it should update the read or write timestamp cache. Only `DeleteRange` creates update spans which need to update the write timestamp cache. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note (sql): significantly reduce the likelihood of serializable restarts seen by clients due to concurrent workloads.
The concept is straightforward: store spans encountered during the course of executing a transaction, up to a maximum limit of bytes. Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`, attempt to "refresh" all of the spans in order to avoid a full txn restart. For the case of write-only SERIALIZABLE transactions, we're now able to set the original timestamp equal to the commit timestamp before committing, entirely circumventing the possibility of a serializable restart. Otherwise, if all spans can be updated, the original timestamp is moved to the commit timestamp (without a transaction restart), and the `EndTransaction` batch is re-submitted. `Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp cache and checking whether any more recent writes occurred to the span. It uses a time-bounded iterator for efficiency. Any writes or intents which have appeared between the transaction's original timestamp and its commit timestamp result in an error. Note that `Refresh(Range)` has a bool indicating whether it should update the read or write timestamp cache. Only `DeleteRange` creates update spans which need to update the write timestamp cache. This change should allow long-lived SERIALIZABLE transactions in the same style as SNAPSHOT, and should decrease the occurence of transaction retry errors which currently must be handled by client code. See cockroachdb#21078 Release note (sql): significantly reduce the likelihood of serializable restarts seen by clients due to concurrent workloads.
Follow up to cockroachdb#23449. This change adds a variant to the roachtest `copy` test that runs the paginated copy from one table to another inside of a transaction. The changes made in cockroachdb#21078 allow this to work even with millions of rows in the source table. Release note: None
Before this change, each txnReqInterceptor held a reference to its parent TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual part of the meta separately (under lock), but a TxnCoordSender could always reach into its own meta and perform any modifications it wanted. As such, there wasn't a hard abstraction boundary. An example of how this abstraction was broken was `tryAsyncAbort`, which used to pull the intents out from under a txnIntentCollector. This change removes this shared state. `txnReqInterceptors` now hold on to their own state and this is only accessible through new `populateMetaLocked` and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender` no longer has any interest or knowledge about intents or refresh spans. These concepts are all completely contained in their respective interceptors. The change also moves away from holding all state in a `TxnCoordMeta` and allowing this state to be pulled out at any time. This is an important change as we look towards the `txnPipeliner` interceptor because that will hold more complex data structures of in-flight writes. These writes will need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators, but this should be on-demand (see `populateMetaLocked`). A similar change can now be made in `txnIntentCollector` along the lines of this [suggestion](cockroachdb#21078 (comment)). I added a TODO to explore. This change also simplifies the handling of concurrent requests and their effect on refresh spans. In doing so, it revealed a race where the meta's Txn.RefreshedTimestamp was only updated under lock before performing the span refresh if a prefix of a batch was executed. If not (`if br == nil {` in `maybeRetrySend`) then spans could be "missed" by the refresh. This was subtle so this commit fixed the bug and made the race prevention more explicit. Release note: None
Before this change, each txnReqInterceptor held a reference to its parent TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual part of the meta separately (under lock), but a TxnCoordSender could always reach into its own meta and perform any modifications it wanted. As such, there wasn't a hard abstraction boundary. An example of how this abstraction was broken was `tryAsyncAbort`, which used to pull the intents out from under a txnIntentCollector. This change removes this shared state. `txnReqInterceptors` now hold on to their own state and this is only accessible through new `populateMetaLocked` and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender` no longer has any interest or knowledge about intents or refresh spans. These concepts are all completely contained in their respective interceptors. The change also moves away from holding all state in a `TxnCoordMeta` and allowing this state to be pulled out at any time. This is an important change as we look towards the `txnPipeliner` interceptor because that will hold more complex data structures of in-flight writes. These writes will need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators, but this should be on-demand (see `populateMetaLocked`). A similar change can now be made in `txnIntentCollector` along the lines of this [suggestion](cockroachdb#21078 (comment)). I added a TODO to explore. This change also simplifies the handling of concurrent requests and their effect on refresh spans. In doing so, it revealed a race where the meta's Txn.RefreshedTimestamp was only updated under lock before performing the span refresh if a prefix of a batch was executed. If not (`if br == nil {` in `maybeRetrySend`) then spans could be "missed" by the refresh. This was subtle so this commit fixed the bug and made the race prevention more explicit. Release note: None
Before this change, each txnReqInterceptor held a reference to its parent TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual part of the meta separately (under lock), but a TxnCoordSender could always reach into its own meta and perform any modifications it wanted. As such, there wasn't a hard abstraction boundary. An example of how this abstraction was broken was `tryAsyncAbort`, which used to pull the intents out from under a txnIntentCollector. This change removes this shared state. `txnReqInterceptors` now hold on to their own state and this is only accessible through new `populateMetaLocked` and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender` no longer has any interest or knowledge about intents or refresh spans. These concepts are all completely contained in their respective interceptors. The change also moves away from holding all state in a `TxnCoordMeta` and allowing this state to be pulled out at any time. This is an important change as we look towards the `txnPipeliner` interceptor because that will hold more complex data structures of in-flight writes. These writes will need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators, but this should be on-demand (see `populateMetaLocked`). A similar change can now be made in `txnIntentCollector` along the lines of this [suggestion](cockroachdb#21078 (comment)). I added a TODO to explore. This change also simplifies the handling of concurrent requests and their effect on refresh spans. In doing so, it revealed a race where the meta's Txn.RefreshedTimestamp was only updated under lock before performing the span refresh if a prefix of a batch was executed. If not (`if br == nil {` in `maybeRetrySend`) then spans could be "missed" by the refresh. This was subtle so this commit fixed the bug and made the race prevention more explicit. Release note: None
Instead of enforcing a limit of 100,000 keys, this change moves to keep the
total size of the transaction record under a new constant set in
base.ChunkRaftCommandThresholdBytes, currently defaulting to256KiB.This same value is used by the gc queue to chunk gc key requests.
In the txn coordinator sender, we now condense affected key spans when they
in aggregate exceed the threshold. Condensing is done by sorting all the
existing key spans and then bucketing them by range using a range iterator.
We then sort the buckets from largest to smallest and condense buckets in
that order until the total size of the key spans is reduced to less than the
threshold for the transaction record size.
In order to make the intent resolution process more palatable with now
potentially full-range scans, a time-bounded iterator is used when resolving
write intents for a range. Small changes were necessary to enable the use
of time bounded iterators with rocksdb batches. The more difficult re-org
was in making sure that we always have the final transaction record when
cleaning up intents for a transaction. This allows us to precisely set
the time bounds on the time-bounded iterator, specifically to
[txn.EpochZeroTimestamp, txn.Timestamp]. You'll notice theEpochZeroTimestampfield. That has been added to the transaction, andis only set when a txn is restarted, so that we can keep track of the
earliest timestamp the transaction might have written at.
Further changes have been made when resolving local intents to track an
"allowance" of resolves so we don't overload the
EndTransactionbatch,and to the intent resolver to more carefully send spanned intent resolves.
The intent resolver command now fully supports the
MaxKeysparameterby returning a resume span.
Fixes #15849
Release note (sql change): Allow arbitrarily large transactions. There was
formerly a limit of 100,000 keys.