storage, kv: remove the max intent limit by spencerkimball · Pull Request #21078 · cockroachdb/cockroach

spencerkimball · 2017-12-28T04:34:45Z

Instead of enforcing a limit of 100,000 keys, this change moves to keep the
total size of the transaction record under a new constant set in
base.ChunkRaftCommandThresholdBytes, currently defaulting to 256KiB.
This same value is used by the gc queue to chunk gc key requests.

In the txn coordinator sender, we now condense affected key spans when they
in aggregate exceed the threshold. Condensing is done by sorting all the
existing key spans and then bucketing them by range using a range iterator.
We then sort the buckets from largest to smallest and condense buckets in
that order until the total size of the key spans is reduced to less than the
threshold for the transaction record size.

In order to make the intent resolution process more palatable with now
potentially full-range scans, a time-bounded iterator is used when resolving
write intents for a range. Small changes were necessary to enable the use
of time bounded iterators with rocksdb batches. The more difficult re-org
was in making sure that we always have the final transaction record when
cleaning up intents for a transaction. This allows us to precisely set
the time bounds on the time-bounded iterator, specifically to
[txn.EpochZeroTimestamp, txn.Timestamp]. You'll notice the
EpochZeroTimestamp field. That has been added to the transaction, and
is only set when a txn is restarted, so that we can keep track of the
earliest timestamp the transaction might have written at.

Further changes have been made when resolving local intents to track an
"allowance" of resolves so we don't overload the EndTransaction batch,
and to the intent resolver to more carefully send spanned intent resolves.
The intent resolver command now fully supports the MaxKeys parameter
by returning a resume span.

Fixes #15849

Release note (sql change): Allow arbitrarily large transactions. There was
formerly a limit of 100,000 keys.

cockroach-teamcity · 2017-12-28T04:34:51Z

This change is

spencerkimball · 2017-12-28T04:35:12Z

Unittests TBD. @tschottdorf what do you think of this approach?

andreimatei · 2017-12-28T14:56:18Z

Exciting!
Please reference #15849 in the commit.

One random comment; it might be confused: intents are always written at txn. OrigTimestamp (not at txn.Timestamp, they're moved there upon cleanup), so I believe you can set tighter bounds on that time-bound iterator. At least for transactions with a single epoch... For txns with multiple epochs, are your current bounds correctly cleaning up intents from old epochs?

And I've got a question: this change doesn't affect the way in which single large inserts are refused, right? Like, INSERT INTO foo SELECT * from bar continues to be rejected with an error about the size of the (single) Raft cmd generated by the INSERT, correct?

spencerkimball · 2017-12-28T19:35:44Z

Good comment about epochs – that was an oversight which I've corrected. I had to add a new field to Transaction to track the earliest timestamp. Not too high a cost, but does decrease the elegance of this a tad.

This won't change single large insert behavior. But it does let you do things like insert 20MiB of keys in batches of 1,000 to the same transaction and commit them without breaking the system. I see no reason, though, why the SQL layer won't be able to fix the INSERT INTO foo SELECT * FROM bar case – it's a "small" matter of streaming. This change certainly lays the groundwork for it though, by solving the too-many-intents problem.

tbg · 2017-12-28T23:48:57Z

I like the approach! Bucketing by range is a good idea (I always pictured we'd somehow try to figure out proximity by looking at the keys, but that seems strictly trickier and probably worse).

Using the time-bounded iterator is interesting too, but it deserves a lot more focused attention. Hence my suggestion is to defer the time-boundedness optimization to a subsequent PR (that may or may not make it into 2.0). That would imply that you don't have to keep track of the first epoch timestamp, and condenses this PR sufficiently (for me) to shepherd it into 2.0. It would be a real bummer to rush a txn bug into the release, and there's no pressing need for us to make large transactions "fast" (for some definition of fast that I'm not sure either with or without the time bounded iterator will be reached).

I also don't think you quite get to put that release note down that you have there yet. We allow some forms of larger transactions, but SQL needs to implement its batching on top of it, as Andrei pointed out (there is also the option to let DistSender break up these large batches, but that's worse since it wouldn't address the problem that the insert would still have to fit into memory completely, which isn't a given for INSERT FROM SELECT ...).

I only skimmed the code, so apologies for this question: do you ever merge buckets (i.e. ranges) or do you end up with one span per range touched in the worst case?

Keys which already exceed your chunk size quote will be interesting to test (ideally they shouldn't go through the somewhat expensive reduction process every time).

cc @danhhz how do you think this would play with #21097? We'd want to run transactions that write lots of intents. I suppose we could create an empty and then have the op run a very long transaction, right? Or create a table with 10k big rows across a few ranges, which are then DELETE * FROM ...'ed -- seems to be a good fit!

Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file):

		var err error
		keys, err = tc.maybeCondenseIntentSpans(ctx, keys)

Why not just remove this?

pkg/roachpb/data.go, line 927 at r1 (raw file):

	}
	if t.EpochZeroTimestamp == (hlc.Timestamp{}) {
		t.EpochZeroTimestamp = o.EpochZeroTimestamp

For sanity, better to have this track the minimum non-zero seen.

pkg/roachpb/data.go, line 1299 at r1 (raw file):

// Combine creates a new span containing the full union of the key
// space covered by the two spans. This includes any key space not
// covered by either span, but between them if the spans are disjoint.

Hmm, I'm surprised this kind of code didn't exist elsewhere already.
It's dangerous to have general-purpose code like this, btw, because of local addressing. This isn't an issue at your callsite because you bucket.

pkg/storage/gc_queue_test.go, line 742 at r1 (raw file):

		// Set a high Timestamp to make sure it does not matter. Only
		// OrigTimestamp (and heartbeat) are used for GC decisions.
		txn.Timestamp.Forward(hlc.MaxTimestamp.Prev() /* can't use hlc.MaxTimestamp due to Transaction.TimeBound() */)

Could you make this comment clearer (and make it a real comment, not an inline one)?

pkg/storage/intent_resolver.go, line 653 at r1 (raw file):

				break
			}
			req.SetHeader(*resp.ResumeSpan)

This might be racy in subtle ways. Once you've passed req to god knows who, you can't be sure that they aren't holding on to it. (This is one of these "rare races", but let's err on the side of caution).

Comments from Reviewable

bdarnell · 2017-12-29T16:48:29Z

This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?). That raises the limit, but it also changes the character of the error you get when you hit it: it becomes a retryable error. This increases the potential for pathological retry behavior for large transactions (which I also noted in #21056 (comment)).

I think a transaction size limit can be a useful safety mechanism. Range intent resolutions are expensive because they block untouched keys in the command queue. In my applications in most cases I'd rather disallow the transaction than have a threshold at which it silently switches to a much more expensive mode. Maybe we should keep a session variable in place so you have some control over whether large transactions are allowed.

I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations.

Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2017-12-29T17:06:22Z

@bdarnell

I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations.

Why prohibitively expensive? I'm not sure why large transactions need to go from "fatal to the system" to "so fast you'd want to use them all the time" (which they wouldn't either way, I don't think).

A point that I see is that it's undesirable for "regular" write operations that are just barely above the thresholds to block the whole range.

This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?).

That's a good point. Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as INSERT ... FROM SELECT ... are not affected by this, though.

With the points that @bdarnell has raised, I'd be more comfortable discussing an RFC rather than the actual code at hand.

spencerkimball · 2017-12-29T18:24:39Z

I’ve come to a conclusion in all of the work I’ve been doing recently that SNAPSHOT is actually far more important than we’ve been giving it credit for. We have basically been talking solely about restarts but I think that’s aleays been a weak justification for even having them. However, in the context of CDC and large transactions, they are a necessary solution. So I do believe this allows for arbitrarily large transactions. I believe the recommendation should be that SNAPSHOT is appropriate for such workloads and instead of treating it as the red headed stepchild of isolation levels, we should be embracing it as a mechanism with a definite and important use case.

…

On Fri, Dec 29, 2017 at 12:06 PM Tobias Schottdorf ***@***.***> wrote: @bdarnell <https://github.com/bdarnell> I think the time-bounded iterator is a necessary part of this change. Without it, I think the intent resolution will be prohibitively expensive (unless we require some sort of opt-in to lift the txn size limit). Even with it, we'll need to be wary of accidentally-quadratic behavior as we do more range operations. Why prohibitively expensive? I'm not sure why large transactions need to go from "fatal to the system" to "so fast you'd want to use them all the time" (which they wouldn't either way, I don't think). A point that I see is that it's undesirable for "regular" write operations that are just barely above the thresholds to block the whole range. This doesn't give you "arbitrarily large" transactions, it just moves the limit from space (number of intents) to time (how much can you write before you start running into the tscache low-water mark?). That's a good point. Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as INSERT ... FROM SELECT ... are not affected by this, though. With the points that @bdarnell <https://github.com/bdarnell> has raised, I'd be more comfortable discussing an RFC rather than the actual code at hand. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21078 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTXH_Rk9C_hCHs53H1TNFMFqyEqMBks5tFRwXgaJpZM4ROAKR> .

tbg · 2017-12-29T18:30:41Z

@spencerkimball I don't disagree for write-only txns, but if you feel strongly about SNAPSHOT for read-write txns, then you may want to participate in this discussion. Things are a lot less clear there.

bdarnell · 2017-12-29T19:35:01Z

Why prohibitively expensive?

"Prohibitively" may be too strong, but I think having to scan base-layer sstables to look for intents (redundantly for each large transaction!) would put the performance of these transactions outside of reasonable expectations.

Write-only transactions (which can automatically be SNAPSHOT, and hopefully is most of the real-world large transactions) such as INSERT ... FROM SELECT ... are not affected by this, though.

INSERT ... FROM SELECT ... is not a write-only transaction - it reads from both the source SELECT table and the destination (UPSERT FROM SELECT could avoid reading from the destination). Sometimes it would be safe to make it SNAPSHOT; sometimes not.

I’ve come to a conclusion in all of the work I’ve been doing recently that
SNAPSHOT is actually far more important than we’ve been giving it credit
for. We have basically been talking solely about restarts but I think
that’s aleays been a weak justification for even having them.

I go back and forth on SNAPSHOT - it's not SERIALIZABLE, but use of SNAPSHOT or weaker isolations are common without (much) ill effect. On the other hand, there's reason to believe that our SNAPSHOT has issues that non-distributed SNAPSHOT implementations don't (such as FK verification as disussed in #19876).

This is an interesting non-restart-related use case for SNAPSHOT. However, the downside of locking the affected rows for the duration of the transaction seems significant.

However, in the context of CDC and large transactions, they are a necessary
solution.

I would turn this into "Low-latency CDC, large transactions, and serializability. Pick two". Giving up serializability requires an application-specific analysis of the required consistency, just like any reduction in isolation level. I think giving up large transactions is more likely to be the best option, followed by increasing the max transaction duration (increasing the latency of CDC).

spencerkimball · 2017-12-29T20:28:16Z

@tschottdorf thanks for pointing to that discussion. I think @bdarnell's first comment is spot on: these delete-range type anomalies are better described as "write skew". The RetryOnDelete flag was probably a mistake to add, and I think ought to be removed or at least not set when we do a DeleteRange in a SNAPSHOT isolation txn. As @andreimatei was strongly arguing in that discussion, the cover we're providing with the specific case of DeleteRange as an KV API primitive is illusory. Did anyone check what Postgres or Oracle do with DELETE FROM <table> WHERE x > 1 AND x < 10? At least Postgres definitively exhibits the anomaly. Why did we decide to "solve" this problem? The answer is because I added a test case for it and we didn't think hard enough about whether the test itself was the problem.

@bdarnell, "giving up large transactions" does not feel like a permanent solution – and your choice of two of three desirable properties / capabilities feels like a fallacy. I agree we can't have them all at once, but there are shades of gray that are still incredibly useful, even vital. I would set up our general mission for design/engineering as follows: "deliver the capabilities necessary to perform workloads our customers desire to run, working with the constraints imposed by our evolving design."

Are large transactions within our remit? I would argue they are, based on the general sense of pain that I experienced (and it sounds like @andreimatei did as well) when I saw the change to limit max intents go in in the first place. Now if handling 10MiB or 100MiB of keys in long-running transactions is possible with this change and SNAPSHOT (I can successfully do both on my laptop), that feels to me like a solution. It's not perfect certainly. You won't be able to make it work with our [current] implementation of SERIALIZABLE because of the timestamp cache limitations. However, I would ask that you consider this from a customer's point of view. They want to use CockroachDB, but they have to deal with some very large transactions here and there. Having the admittedly imperfect solution of switching to SNAPSHOT would be something they'd eagerly embrace; I don't think non-serializable anomalies are going to qualify their sense of relief.

bdarnell · 2017-12-30T00:09:37Z

"giving up large transactions" does not feel like a permanent solution – and your choice of two of three desirable properties / capabilities feels like a fallacy. I agree we can't have them all at once, but there are shades of gray that are still incredibly useful, even vital.

Right. I was describing a multi-dimensional space (like "good, fast, cheap"), not three binary properties.

Are large transactions within our remit?

Yes, of course. Increasing our various arbitrary limits including transaction size will almost always make our product better. But it's a question of timing and opportunity cost. This change will have follow-up work to improve the SQL execution of certain changes that are not yet streamed, and may lead to firefighting. Is this the right limit to be fighting at this time? (According to airtable, this is on the roadmap for 2.2, although the github issue is tagged 2.0)

they have to deal with some very large transactions here and there.

Do they? It seems to me that in most cases where very large transactions arise, it's for something like importing or cleaning up data in non-live tables and transactionality is not required. Repeating a DELETE with a LIMIT clause is not an aesthetically pleasing solution but it works.

I don't think non-serializable anomalies are going to qualify their sense of relief.

Depends on the kind of anomaly. I don't think postgres will ever let you violate a foreign key constraint regardless of your isolation level. We do if you're in SNAPSHOT isolation. This seems like a big difference.

spencerkimball · 2017-12-30T01:42:23Z

@bdarnell, admittedly there is still work to do on this but as far as opportunity cost, this felt like an easy pickup. It took a day, and to me felt like a bug that should be cleaned up. The fact that there is still work on the sql side for some use cases which create large transactions seems beside the point.

The use case I have in mind is straightforward and one I ran into recently while testing: I just want to open a txn and write a boatload of stuff. It didn’t seem like a big deal and keys were small and I hit the 100k limit and that seemed buggy.

Your example about the multiple DELETE FROM statements with a limit underscores the point. I suspect that’s a common enough use case and I don’t see why people should be barred from using a txn to do it when that will likely feel natural to some users. So your comment that it’s not aesthetically pleasing but it works, only applies to databases other than cockroach – at least the “works” part. It sort of works in Cockroach, just don’t use any txns.

As to our snapshot mode allowing foreign key constraint violations, I wasn’t aware of that. How positive are we of that? Could you point me at the example?

bdarnell · 2017-12-30T18:34:06Z

I think this is probably a good change, I'm just calling for more caution in how we implement it. The intent limit is a problem, but it is predictable and gives a message that accurately points to the transaction size as the culprit. If we remove it, we may trade it for other more subtle problems. I'm not looking forward to the first support request we get for problems related to large transactions.

This change does more than enable transactions that would previously fail: it also changes some existing transactions (with fewer than 100k intents but whose keys add up to more than 256K) to use range intent resolutions instead of point resolutions. What will the consequences of this be? How will we test the behavior of large transactions and their impact on other traffic? What kinds of workloads do we need to be concerned about?

I don’t see why people should be barred from using a txn to do it when that will likely feel natural to some users.

They shouldn't be "barred" from using a large transaction, but they also shouldn't be "barred" from using correlated subqueries, or complex joins, or updateable views, or from having tools to diagnose what's going on in their cluster. A mature SQL database has an enormous feature set. We only cover part of that, and of the part that we cover there are things that don't work as well as they should. At our current stage of development, expanding the set of things that are "supported" but don't work as well as they should would do more harm than good. I fear that large transactions would fall into this category.

As to our snapshot mode allowing foreign key constraint violations, I wasn’t aware of that. How positive are we of that? Could you point me at the example?

As far as I know no one has put together a complete example yet, but FK validations do their reads at the OrigTimestamp, not the commit timestamp, so it should be straightforward.

Insert row P into parent table
Start transaction to insert row C into child table, with FK pointing to P
Delete row P
Transaction for row C gets pushed to after the deletion of row P
Row C commits after retry. The FK validation passes because it read from the time at step 2, but it now points to a non-existent row.

spencerkimball · 2017-12-30T18:55:02Z

@bdarnell thank you for the example. Yes that is indeed an example of write skew. It's really a shame that we didn't implement the FOR UPDATE locks on reads... Another example where something useful didn't fit well for all use cases and was scotched after a month of analysis. Seems that the simple fix to make SNAPSHOT isolation work properly would have been to tag that bool in our scans when checking foreign key constraints. Alas, the functionality didn't make the cut because it had impedance mismatch with strict interpretations of SQL guarantees. I draw a wholly opposite conclusion about caution. I think we have an overabundance of it, and it seems to be leading to slower, and worse, outcomes. YMMV.

@tschottdorf in my testing, where I modified the kv loadgen to generate huge transactions containing many batches of upserts, using big keys, not using a time bounded iterator shows a throughput curve that slopes down and to the right, and then bottoms out into a trickle. With the time bounded iterator, things actually look pretty good by comparison. I think it's necessary if we accept the goal is to make these kinds of transactions feasible.

I'm going to work on an addendum to this PR which will allow SERIALIZABLE transactions to avoid restarts and hopefully run arbitrarily long, despite the time bounds currently on the timestamp cache and the proposed time bounds for follower reads. There will again be some gotchas, but my feeling is it will enable functionality we actually need to build in the 2.0 / 2.1 timeframe.

Review status: 0 of 28 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/kv/txn_coord_sender.go, line 996 at r1 (raw file):