Skip to content

storage, kv: remove the max intent limit#21078

Merged
spencerkimball merged 1 commit intocockroachdb:masterfrom
spencerkimball:allow-arbitrary-txn-sizes
Jan 15, 2018
Merged

storage, kv: remove the max intent limit#21078
spencerkimball merged 1 commit intocockroachdb:masterfrom
spencerkimball:allow-arbitrary-txn-sizes

Conversation

@spencerkimball
Copy link
Copy Markdown
Member

@spencerkimball spencerkimball commented Dec 28, 2017

Instead of enforcing a limit of 100,000 keys, this change moves to keep the
total size of the transaction record under a new constant set in
base.ChunkRaftCommandThresholdBytes, currently defaulting to 256KiB.
This same value is used by the gc queue to chunk gc key requests.

In the txn coordinator sender, we now condense affected key spans when they
in aggregate exceed the threshold. Condensing is done by sorting all the
existing key spans and then bucketing them by range using a range iterator.
We then sort the buckets from largest to smallest and condense buckets in
that order until the total size of the key spans is reduced to less than the
threshold for the transaction record size.

In order to make the intent resolution process more palatable with now
potentially full-range scans, a time-bounded iterator is used when resolving
write intents for a range. Small changes were necessary to enable the use
of time bounded iterators with rocksdb batches. The more difficult re-org
was in making sure that we always have the final transaction record when
cleaning up intents for a transaction. This allows us to precisely set
the time bounds on the time-bounded iterator, specifically to
[txn.EpochZeroTimestamp, txn.Timestamp]. You'll notice the
EpochZeroTimestamp field. That has been added to the transaction, and
is only set when a txn is restarted, so that we can keep track of the
earliest timestamp the transaction might have written at.

Further changes have been made when resolving local intents to track an
"allowance" of resolves so we don't overload the EndTransaction batch,
and to the intent resolver to more carefully send spanned intent resolves.
The intent resolver command now fully supports the MaxKeys parameter
by returning a resume span.

Fixes #15849

Release note (sql change): Allow arbitrarily large transactions. There was
formerly a limit of 100,000 keys.

@spencerkimball spencerkimball requested review from a team and tbg December 28, 2017 04:34
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Unittests TBD. @tschottdorf what do you think of this approach?

@andreimatei
Copy link
Copy Markdown
Contributor

Exciting!
Please reference #15849 in the commit.

One random comment; it might be confused: intents are always written at txn. OrigTimestamp (not at txn.Timestamp, they're moved there upon cleanup), so I believe you can set tighter bounds on that time-bound iterator. At least for transactions with a single epoch... For txns with multiple epochs, are your current bounds correctly cleaning up intents from old epochs?

And I've got a question: this change doesn't affect the way in which single large inserts are refused, right? Like, INSERT INTO foo SELECT * from bar continues to be rejected with an error about the size of the (single) Raft cmd generated by the INSERT, correct?

@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Dec 28, 2017

Good comment about epochs – that was an oversight which I've corrected. I had to add a new field to Transaction to track the earliest timestamp. Not too high a cost, but does decrease the elegance of this a tad.

This won't change single large insert behavior. But it does let you do things like insert 20MiB of keys in batches of 1,000 to the same transaction and commit them without breaking the system. I see no reason, though, why the SQL layer won't be able to fix the INSERT INTO foo SELECT * FROM bar case – it's a "small" matter of streaming. This change certainly lays the groundwork for it though, by solving the too-many-intents problem.

@spencerkimball spencerkimball force-pushed the allow-arbitrary-txn-sizes branch from ae476da to 0cf7f8f Compare December 28, 2017 20:10
@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 28, 2017

I like the approach! Bucketing by range is a good idea (I always pictured we'd somehow try to figure out proximity by looking at the keys, but that seems strictly trickier and probably worse).

Using the time-bounded iterator is interesting too, but it deserves a lot more focused attention. Hence my suggestion is to defer the time-boundedness optimization to a subsequent PR (that may or may not make it into 2.0). That would imply that you don't have to keep track of the first epoch timestamp, and condenses this PR sufficiently (for me) to shepherd it into 2.0. It would be a real bummer to rush a txn bug into the release, and there's no pressing need for us to make large transactions "fast" (for some definition of fast that I'm not sure either with or without the time bounded iterator will be reached).

I also don't think you quite get to put that release note down that you have there yet. We allow some forms of larger transactions, but SQL needs to implement its batching on top of it, as Andrei pointed out (there is also the option to let DistSender break up these large batches, but that's worse since it wouldn't address the problem that the insert would still have to fit into memory completely, which isn't a given for INSERT FROM SELECT ...).

I only skimmed the code, so apologies for this question: do you ever merge buckets (i.e. ranges) or do you end up with one span per range touched in the worst case?

Keys which already exceed your chunk size quote will be interesting to test (ideally they shouldn't go through the somewhat expensive reduction process every time).

cc @danhhz how do you think this would play with #21097? We'd want to run transactions that write lots of intents. I suppose we could create an empty and then have the op run a very long transaction, right? Or create a table with 10k big rows across a few ranges, which are then DELETE * FROM ...'ed -- seems to be a good fit!


Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file):

		var err error
		keys, err = tc.maybeCondenseIntentSpans(ctx, keys)

Why not just remove this?


pkg/roachpb/data.go, line 927 at r1 (raw file):

	}
	if t.EpochZeroTimestamp == (hlc.Timestamp{}) {
		t.EpochZeroTimestamp = o.EpochZeroTimestamp

For sanity, better to have this track the minimum non-zero seen.


pkg/roachpb/data.go, line 1299 at r1 (raw file):

// Combine creates a new span containing the full union of the key
// space covered by the two spans. This includes any key space not
// covered by either span, but between them if the spans are disjoint.

Hmm, I'm surprised this kind of code didn't exist elsewhere already.
It's dangerous to have general-purpose code like this, btw, because of local addressing. This isn't an issue at your callsite because you bucket.


pkg/storage/gc_queue_test.go, line 742 at r1 (raw file):

		// Set a high Timestamp to make sure it does not matter. Only
		// OrigTimestamp (and heartbeat) are used for GC decisions.
		txn.Timestamp.Forward(hlc.MaxTimestamp.Prev() /* can't use hlc.MaxTimestamp due to Transaction.TimeBound() */)

Could you make this comment clearer (and make it a real comment, not an inline one)?


pkg/storage/intent_resolver.go, line 653 at r1 (raw file):

				break
			}
			req.SetHeader(*resp.ResumeSpan)

This might be racy in subtle ways. Once you've passed req to god knows who, you can't be sure that they aren't holding on to it. (This is one of these "rare races", but let's err on the side of caution).


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?). That raises the limit, but it also changes the character of the error you get when you hit it: it becomes a retryable error. This increases the potential for pathological retry behavior for large transactions (which I also noted in #21056 (comment)).

I think a transaction size limit can be a useful safety mechanism. Range intent resolutions are expensive because they block untouched keys in the command queue. In my applications in most cases I'd rather disallow the transaction than have a threshold at which it silently switches to a much more expensive mode. Maybe we should keep a session variable in place so you have some control over whether large transactions are allowed.

I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations.


Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 29, 2017

@bdarnell

I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations.

Why prohibitively expensive? I'm not sure why large transactions need to go from "fatal to the system" to "so fast you'd want to use them all the time" (which they wouldn't either way, I don't think).

A point that I see is that it's undesirable for "regular" write operations that are just barely above the thresholds to block the whole range.

This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?).

That's a good point. Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as INSERT ... FROM SELECT ... are not affected by this, though.

With the points that @bdarnell has raised, I'd be more comfortable discussing an RFC rather than the actual code at hand.

@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Dec 29, 2017 via email

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 29, 2017

@spencerkimball I don't disagree for write-only txns, but if you feel strongly about SNAPSHOT for read-write txns, then you may want to participate in this discussion. Things are a lot less clear there.

@bdarnell
Copy link
Copy Markdown
Contributor

Why prohibitively expensive?

"Prohibitively" may be too strong, but I think having to scan base-layer sstables to look for intents (redundantly for each large transaction!) would put the performance of these transactions outside of reasonable expectations.

Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as INSERT ... FROM SELECT ... are not affected by this, though.

INSERT ... FROM SELECT ... is not a write-only transaction - it reads from both the source SELECT table and the destination (UPSERT FROM SELECT could avoid reading from the destination). Sometimes it would be safe to make it SNAPSHOT; sometimes not.

I’ve come to a conclusion in all of the work I’ve been doing recently that
SNAPSHOT is actually far more important than we’ve been giving it credit
for. We have basically been talking solely about restarts but I think
that’s aleays been a weak justification for even having them.

I go back and forth on SNAPSHOT - it's not SERIALIZABLE, but use of SNAPSHOT or weaker isolations are common without (much) ill effect. On the other hand, there's reason to believe that our SNAPSHOT has issues that non-distributed SNAPSHOT implementations don't (such as FK verification as disussed in #19876).

This is an interesting non-restart-related use case for SNAPSHOT. However, the downside of locking the affected rows for the duration of the transaction seems significant.

However, in the context of CDC and large transactions, they are a necessary
solution.

I would turn this into "Low-latency CDC, large transactions, and serializability. Pick two". Giving up serializability requires an application-specific analysis of the required consistency, just like any reduction in isolation level. I think giving up large transactions is more likely to be the best option, followed by increasing the max transaction duration (increasing the latency of CDC).

@spencerkimball
Copy link
Copy Markdown
Member Author

@tschottdorf thanks for pointing to that discussion. I think @bdarnell's first comment is spot on: these delete-range type anomalies are better described as "write skew". The RetryOnDelete flag was probably a mistake to add, and I think ought to be removed or at least not set when we do a DeleteRange in a SNAPSHOT isolation txn. As @andreimatei was strongly arguing in that discussion, the cover we're providing with the specific case of DeleteRange as an KV API primitive is illusory. Did anyone check what Postgres or Oracle do with DELETE FROM <table> WHERE x > 1 AND x < 10? At least Postgres definitively exhibits the anomaly. Why did we decide to "solve" this problem? The answer is because I added a test case for it and we didn't think hard enough about whether the test itself was the problem.

@bdarnell, "giving up large transactions" does not feel like a permanent solution – and your choice of two of three desirable properties / capabilities feels like a fallacy. I agree we can't have them all at once, but there are shades of gray that are still incredibly useful, even vital. I would set up our general mission for design/engineering as follows: "deliver the capabilities necessary to perform workloads our customers desire to run, working with the constraints imposed by our evolving design."

Are large transactions within our remit? I would argue they are, based on the general sense of pain that I experienced (and it sounds like @andreimatei did as well) when I saw the change to limit max intents go in in the first place. Now if handling 10MiB or 100MiB of keys in long-running transactions is possible with this change and SNAPSHOT (I can successfully do both on my laptop), that feels to me like a solution. It's not perfect certainly. You won't be able to make it work with our [current] implementation of SERIALIZABLE because of the timestamp cache limitations. However, I would ask that you consider this from a customer's point of view. They want to use CockroachDB, but they have to deal with some very large transactions here and there. Having the admittedly imperfect solution of switching to SNAPSHOT would be something they'd eagerly embrace; I don't think non-serializable anomalies are going to qualify their sense of relief.

@bdarnell
Copy link
Copy Markdown
Contributor

"giving up large transactions" does not feel like a permanent solution – and your choice of two of three desirable properties / capabilities feels like a fallacy. I agree we can't have them all at once, but there are shades of gray that are still incredibly useful, even vital.

Right. I was describing a multi-dimensional space (like "good, fast, cheap"), not three binary properties.

Are large transactions within our remit?

Yes, of course. Increasing our various arbitrary limits including transaction size will almost always make our product better. But it's a question of timing and opportunity cost. This change will have follow-up work to improve the SQL execution of certain changes that are not yet streamed, and may lead to firefighting. Is this the right limit to be fighting at this time? (According to airtable, this is on the roadmap for 2.2, although the github issue is tagged 2.0)

they have to deal with some very large transactions here and there.

Do they? It seems to me that in most cases where very large transactions arise, it's for something like importing or cleaning up data in non-live tables and transactionality is not required. Repeating a DELETE with a LIMIT clause is not an aesthetically pleasing solution but it works.

I don't think non-serializable anomalies are going to qualify their sense of relief.

Depends on the kind of anomaly. I don't think postgres will ever let you violate a foreign key constraint regardless of your isolation level. We do if you're in SNAPSHOT isolation. This seems like a big difference.

@spencerkimball
Copy link
Copy Markdown
Member Author

@bdarnell, admittedly there is still work to do on this but as far as opportunity cost, this felt like an easy pickup. It took a day, and to me felt like a bug that should be cleaned up. The fact that there is still work on the sql side for some use cases which create large transactions seems beside the point.

The use case I have in mind is straightforward and one I ran into recently while testing: I just want to open a txn and write a boatload of stuff. It didn’t seem like a big deal and keys were small and I hit the 100k limit and that seemed buggy.

Your example about the multiple DELETE FROM statements with a limit underscores the point. I suspect that’s a common enough use case and I don’t see why people should be barred from using a txn to do it when that will likely feel natural to some users. So your comment that it’s not aesthetically pleasing but it works, only applies to databases other than cockroach – at least the “works” part. It sort of works in Cockroach, just don’t use any txns.

As to our snapshot mode allowing foreign key constraint violations, I wasn’t aware of that. How positive are we of that? Could you point me at the example?

@bdarnell
Copy link
Copy Markdown
Contributor

I think this is probably a good change, I'm just calling for more caution in how we implement it. The intent limit is a problem, but it is predictable and gives a message that accurately points to the transaction size as the culprit. If we remove it, we may trade it for other more subtle problems. I'm not looking forward to the first support request we get for problems related to large transactions.

This change does more than enable transactions that would previously fail: it also changes some existing transactions (with fewer than 100k intents but whose keys add up to more than 256K) to use range intent resolutions instead of point resolutions. What will the consequences of this be? How will we test the behavior of large transactions and their impact on other traffic? What kinds of workloads do we need to be concerned about?

I don’t see why people should be barred from using a txn to do it when that will likely feel natural to some users.

They shouldn't be "barred" from using a large transaction, but they also shouldn't be "barred" from using correlated subqueries, or complex joins, or updateable views, or from having tools to diagnose what's going on in their cluster. A mature SQL database has an enormous feature set. We only cover part of that, and of the part that we cover there are things that don't work as well as they should. At our current stage of development, expanding the set of things that are "supported" but don't work as well as they should would do more harm than good. I fear that large transactions would fall into this category.

As to our snapshot mode allowing foreign key constraint violations, I wasn’t aware of that. How positive are we of that? Could you point me at the example?

As far as I know no one has put together a complete example yet, but FK validations do their reads at the OrigTimestamp, not the commit timestamp, so it should be straightforward.

  1. Insert row P into parent table
  2. Start transaction to insert row C into child table, with FK pointing to P
  3. Delete row P
  4. Transaction for row C gets pushed to after the deletion of row P
  5. Row C commits after retry. The FK validation passes because it read from the time at step 2, but it now points to a non-existent row.

@spencerkimball
Copy link
Copy Markdown
Member Author

@bdarnell thank you for the example. Yes that is indeed an example of write skew. It's really a shame that we didn't implement the FOR UPDATE locks on reads... Another example where something useful didn't fit well for all use cases and was scotched after a month of analysis. Seems that the simple fix to make SNAPSHOT isolation work properly would have been to tag that bool in our scans when checking foreign key constraints. Alas, the functionality didn't make the cut because it had impedance mismatch with strict interpretations of SQL guarantees. I draw a wholly opposite conclusion about caution. I think we have an overabundance of it, and it seems to be leading to slower, and worse, outcomes. YMMV.

@tschottdorf in my testing, where I modified the kv loadgen to generate huge transactions containing many batches of upserts, using big keys, not using a time bounded iterator shows a throughput curve that slopes down and to the right, and then bottoms out into a trickle. With the time bounded iterator, things actually look pretty good by comparison. I think it's necessary if we accept the goal is to make these kinds of transactions feasible.

I'm going to work on an addendum to this PR which will allow SERIALIZABLE transactions to avoid restarts and hopefully run arbitrarily long, despite the time bounds currently on the timestamp cache and the proposed time bounds for follower reads. There will again be some gotchas, but my feeling is it will enable functionality we actually need to build in the 2.0 / 2.1 timeframe.


Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why not just remove this?

Because I don't want the memory on the coordinator exploding either. This came up in testing.


pkg/roachpb/data.go, line 927 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

For sanity, better to have this track the minimum non-zero seen.

Done.


pkg/roachpb/data.go, line 1299 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Hmm, I'm surprised this kind of code didn't exist elsewhere already.
It's dangerous to have general-purpose code like this, btw, because of local addressing. This isn't an issue at your callsite because you bucket.

I've added a comment and made changes to my call site, because I think local spans should not be combined, period.


pkg/storage/gc_queue_test.go, line 742 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Could you make this comment clearer (and make it a real comment, not an inline one)?

Done.


pkg/storage/intent_resolver.go, line 653 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This might be racy in subtle ways. Once you've passed req to god knows who, you can't be sure that they aren't holding on to it. (This is one of these "rare races", but let's err on the side of caution).

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the allow-arbitrary-txn-sizes branch 2 times, most recently from 210a39a to 5a8f933 Compare December 31, 2017 15:50
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 1, 2018
The concept is straightforward: store read spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "update" all of the read spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all read spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`UpdateRead` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note: None
@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Jan 1, 2018

It's really a shame that we didn't implement the FOR UPDATE locks on reads... Another example where something useful didn't fit well for all use cases and was scotched after a month of analysis. Seems that the simple fix to make SNAPSHOT isolation work properly would have been to tag that bool in our scans when checking foreign key constraints.

Writing an intent while doing reads for foreign keys would be incredibly expensive and inhibit concurrency. This would almost certainly be worse for performance than simply running in SERIALIZABLE.

OTOH, if we wanted to use a KV operation internally that both reads a value and writes an intent, we could do that for internal use without exposing it as SELECT FOR UPDATE. The reason we abandoned SFU wasn't that it "didn't fit well for all use cases", it was that the semantics we could provide would not have meet the needs of literally any of the uses of SFU that had come to our attention (which were all using SFU as an advisory lock for actions outside the transaction).

Anyway, I'm going to lay off this PR for a while and focus on #21056 and #21140. All of these PRs have some amount of risk, but I think the upside is much higher with the other two so I'm more than willing to work through the complexity there.

@spencerkimball spencerkimball force-pushed the allow-arbitrary-txn-sizes branch from 5a8f933 to d8a113f Compare January 1, 2018 22:23
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 2, 2018
The concept is straightforward: store read spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "update" all of the read spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all read spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`UpdateRead` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 23, 2018
`(*rocksDBBatch).NewTimeboundedIterator` was caching the iterator in a
way that would return the iterator in future calls to
`(*rocksDBBatch).NewIterator`, which was a correctness problem.

Luckily, the fallout was somewhat limited by the fact that we only use
time bounded iterators when resolving an intent, and that is usually
(but not always) the last operation carried out on the batch.

This was introduced in cockroachdb#21078, which landed around Jan 15 2018, so
the bug has not made it into a release yet, though we'll have to bump
our alpha (cockroachdb#21657).

Release note: None

Fixes cockroachdb#21689.
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 30, 2018
The concept is straightforward: store spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "update" all of the spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`UpdateSpan` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error. Note that `UpdateSpan` has a bool
indicating whether it should update the read or write timestamp cache.
Only `DeleteRange` creates update spans which need to update the write
timestamp cache.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note (sql): significantly reduce the likelihood of serializable
restarts seen by clients due to concurrent workloads.
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 30, 2018
The concept is straightforward: store spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "update" all of the spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`UpdateSpan` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error. Note that `UpdateSpan` has a bool
indicating whether it should update the read or write timestamp cache.
Only `DeleteRange` creates update spans which need to update the write
timestamp cache.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note (sql): significantly reduce the likelihood of serializable
restarts seen by clients due to concurrent workloads.
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 31, 2018
The concept is straightforward: store spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "refresh" all of the spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error. Note that `Refresh(Range)` has a bool
indicating whether it should update the read or write timestamp cache.
Only `DeleteRange` creates update spans which need to update the write
timestamp cache.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note (sql): significantly reduce the likelihood of serializable
restarts seen by clients due to concurrent workloads.
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Jan 31, 2018
The concept is straightforward: store spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "refresh" all of the spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error. Note that `Refresh(Range)` has a bool
indicating whether it should update the read or write timestamp cache.
Only `DeleteRange` creates update spans which need to update the write
timestamp cache.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note (sql): significantly reduce the likelihood of serializable
restarts seen by clients due to concurrent workloads.
spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Feb 1, 2018
The concept is straightforward: store spans encountered during
the course of executing a transaction, up to a maximum limit of bytes.
Upon encountering a transaction retry error with `reason=RETRY_SERIALIZABLE`,
attempt to "refresh" all of the spans in order to avoid a full txn
restart. For the case of write-only SERIALIZABLE transactions, we're now
able to set the original timestamp equal to the commit timestamp before
committing, entirely circumventing the possibility of a serializable
restart. Otherwise, if all spans can be updated, the original timestamp
is moved to the commit timestamp (without a transaction restart), and the
`EndTransaction` batch is re-submitted.

`Refresh(Range)` is a new KV RPC which re-reads a span, updating the timestamp
cache and checking whether any more recent writes occurred to the span.
It uses a time-bounded iterator for efficiency. Any writes or intents
which have appeared between the transaction's original timestamp and its
commit timestamp result in an error. Note that `Refresh(Range)` has a bool
indicating whether it should update the read or write timestamp cache.
Only `DeleteRange` creates update spans which need to update the write
timestamp cache.

This change should allow long-lived SERIALIZABLE transactions in the same
style as SNAPSHOT, and should decrease the occurence of transaction retry
errors which currently must be handled by client code.

See cockroachdb#21078

Release note (sql): significantly reduce the likelihood of serializable
restarts seen by clients due to concurrent workloads.
nvb added a commit to nvb/cockroach that referenced this pull request Mar 8, 2018
Follow up to cockroachdb#23449.

This change adds a variant to the roachtest `copy` test that
runs the paginated copy from one table to another inside of
a transaction. The changes made in cockroachdb#21078 allow this to work
even with millions of rows in the source table.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Jun 8, 2018
Before this change, each txnReqInterceptor held a reference to its parent
TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual
part of the meta separately (under lock), but a TxnCoordSender could always
reach into its own meta and perform any modifications it wanted. As such,
there wasn't a hard abstraction boundary. An example of how this abstraction
was broken was `tryAsyncAbort`, which used to pull the intents out from under
a txnIntentCollector.

This change removes this shared state. `txnReqInterceptors` now hold on to
their own state and this is only accessible through new `populateMetaLocked`
and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender`
no longer has any interest or knowledge about intents or refresh spans.
These concepts are all completely contained in their respective interceptors.

The change also moves away from holding all state in a `TxnCoordMeta` and
allowing this state to be pulled out at any time. This is an important
change as we look towards the `txnPipeliner` interceptor because that will
hold more complex data structures of in-flight writes. These writes will
need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators,
but this should be on-demand (see `populateMetaLocked`). A similar change
can now be made in `txnIntentCollector` along the lines of this
[suggestion](cockroachdb#21078 (comment)).
I added a TODO to explore.

This change also simplifies the handling of concurrent requests and their
effect on refresh spans. In doing so, it revealed a race where the meta's
Txn.RefreshedTimestamp was only updated under lock before performing the span
refresh if a prefix of a batch was executed. If not (`if br == nil {` in
`maybeRetrySend`) then spans could be "missed" by the refresh. This was
subtle so this commit fixed the bug and made the race prevention more explicit.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Jun 14, 2018
Before this change, each txnReqInterceptor held a reference to its parent
TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual
part of the meta separately (under lock), but a TxnCoordSender could always
reach into its own meta and perform any modifications it wanted. As such,
there wasn't a hard abstraction boundary. An example of how this abstraction
was broken was `tryAsyncAbort`, which used to pull the intents out from under
a txnIntentCollector.

This change removes this shared state. `txnReqInterceptors` now hold on to
their own state and this is only accessible through new `populateMetaLocked`
and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender`
no longer has any interest or knowledge about intents or refresh spans.
These concepts are all completely contained in their respective interceptors.

The change also moves away from holding all state in a `TxnCoordMeta` and
allowing this state to be pulled out at any time. This is an important
change as we look towards the `txnPipeliner` interceptor because that will
hold more complex data structures of in-flight writes. These writes will
need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators,
but this should be on-demand (see `populateMetaLocked`). A similar change
can now be made in `txnIntentCollector` along the lines of this
[suggestion](cockroachdb#21078 (comment)).
I added a TODO to explore.

This change also simplifies the handling of concurrent requests and their
effect on refresh spans. In doing so, it revealed a race where the meta's
Txn.RefreshedTimestamp was only updated under lock before performing the span
refresh if a prefix of a batch was executed. If not (`if br == nil {` in
`maybeRetrySend`) then spans could be "missed" by the refresh. This was
subtle so this commit fixed the bug and made the race prevention more explicit.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Jun 18, 2018
Before this change, each txnReqInterceptor held a reference to its parent
TxnCoordSender's TxnCoordMeta. Each interceptor would update its individual
part of the meta separately (under lock), but a TxnCoordSender could always
reach into its own meta and perform any modifications it wanted. As such,
there wasn't a hard abstraction boundary. An example of how this abstraction
was broken was `tryAsyncAbort`, which used to pull the intents out from under
a txnIntentCollector.

This change removes this shared state. `txnReqInterceptors` now hold on to
their own state and this is only accessible through new `populateMetaLocked`
and `augmentMetaLocked` methods. Because of this change, `TxnCoordSender`
no longer has any interest or knowledge about intents or refresh spans.
These concepts are all completely contained in their respective interceptors.

The change also moves away from holding all state in a `TxnCoordMeta` and
allowing this state to be pulled out at any time. This is an important
change as we look towards the `txnPipeliner` interceptor because that will
hold more complex data structures of in-flight writes. These writes will
need to be serialized into a `TxnCoordMeta` when creating LeafTxn coordinators,
but this should be on-demand (see `populateMetaLocked`). A similar change
can now be made in `txnIntentCollector` along the lines of this
[suggestion](cockroachdb#21078 (comment)).
I added a TODO to explore.

This change also simplifies the handling of concurrent requests and their
effect on refresh spans. In doing so, it revealed a race where the meta's
Txn.RefreshedTimestamp was only updated under lock before performing the span
refresh if a prefix of a batch was executed. If not (`if br == nil {` in
`maybeRetrySend`) then spans could be "missed" by the refresh. This was
subtle so this commit fixed the bug and made the race prevention more explicit.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stability: Support for large transactions

6 participants