sql: fix internal executor usage in leaf txn by andreimatei · Pull Request #45966 · cockroachdb/cockroach

andreimatei · 2020-03-10T20:41:21Z

The internal executor was setting a priority on a leaf txn, which is not
allowed. It was supposed to be a no-op since the priority was the same
as the existing one, but it crashed nonetheless. This patch makes the
no-op even more no-op.

Fixes #45924

Release note: None

Release justification: crash fix

cockroach-teamcity · 2020-03-10T20:41:28Z

This change is

andreimatei · 2020-03-10T20:42:57Z

@jordanlewis pretty please find me a volunteer to write a test for this. We need something that uses the internal executor on remote notes, through DistSQL. I tried simple stuff in the vein of what SQLSmith did in the linked issue, but couldn't repro immediately and I'd rather pass it to someone else.

andreimatei · 2020-04-15T15:54:28Z

SQLSmith hit this again.
Jordan, help me out with testing this.

The query that SQLSmith says it ran, I think, is:

        sqlsmith.go:169,sqlsmith.go:199,test_runner.go:753: error: pq: internal error: "$0"::GEOMETRY: invalid cast: unknown -> geometry
                stmt:
                WITH
                        with_32520 (col_220310)
                                AS (
                                        SELECT
                                                *
                                        FROM
                                                (
                                                        VALUES
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY),
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY),
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY)
                                                )
                                                        AS tab_95769 (col_220310)
                                        EXCEPT
                                                SELECT * FROM (VALUES (NULL)) AS tab_95770 (col_220311)
                                                INTERSECT ALL SELECT * FROM (VALUES (NULL)) AS tab_95771 (col_220312)
                                ),
                        with_32521 (col_220313, col_220314)
                                AS (
                                        SELECT
                                                *
                                        FROM
                                                (
                                                        VALUES
                                                                (NULL, 'da58c55f-7009-4a6b-ba23-0350742084d8':::UUID),
                                                                (7656877633584715458:::INT8, '5f6cb3d7-aa75-4649-83bf-b79d222aea38':::UUID),
                                                                ((-2693926275475684144):::INT8, '6b330e4a-bf91-4a5f-854d-b1ddc8712721':::UUID)
                                                )
                                                        AS tab_95772 (col_220313, col_220314)
                                )
                SELECT
                        tab_95773.col1_4 AS col_220315, tab_95773.col1_8 AS col_220316
                FROM
                        defaultdb.public.table1@table1_col1_9_col1_11_col1_6_col1_12_col1_3_col1_5_col1_2_col1_0_col1_8_col1_7_col1_1_col1_4_key
                                AS tab_95773,
                        with_32520 AS cte_ref_9570,
                        defaultdb.public.table2@[0] AS tab_95774
                WHERE
                        true
                LIMIT
                        4:::INT8;

andreimatei · 2020-08-21T22:44:13Z

friendly ping

asubiotto · 2020-09-09T14:20:04Z

Added a test that reproduces the panic deterministically on master. However, it seems like calls to StartServer hang and timeout while ensuring migrations through the internal executor. I'm not sure from a first glance how that has been caused. @andreimatei could you take a look? Otherwise I'll come back to this later.

asubiotto · 2020-09-11T10:21:25Z

Fixed the hang which was just a double mutex acquisition with the new code. I'm now running into:

panic: txn.ConfigureStepping() only allowed in RootTxn

goroutine 1168 [running]:
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).ConfigureStepping(0xc0018d2630, 0x9065260, 0xc000ad1bc0, 0x9083c01, 0xc002132600)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:1262 +0x177
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc001ac5000, 0x9065260, 0xc000ad1bc0, 0x9083ca0, 0xc002132640, 0x847d134, 0x8, 0x0, 0x0, 0x0, ...)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:575 +0xa05
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc001ac5000, 0x9065260, 0xc000ad1bc0, 0x9083ca0, 0xc002132640, 0x847d134, 0x8, 0x0, 0x0, 0x0, ...)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:116 +0x9e8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc001ac5000, 0x90651a0, 0xc0022b6900, 0x0, 0x0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1459 +0x1c34
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc001ac5000, 0x9065260, 0xc000ad1860, 0xc000a52640, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1388 +0x1fc
github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx.func1(0xc001ac5000, 0x9065260, 0xc000ad1860, 0xc000d4d260, 0xc001c6ef00, 0xc0018d2630, 0xc0020863f0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:175 +0x65
created by github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:174 +0x325

While running the test. I added a check to only configure stepping and step if the txn is not a leaf txn as well as one other check in conn_executor_exec.go but :ihavenoideawhatimdoing:

This is now RFAL @andreimatei PTAL

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

	// here.)

	// We first ensure stepping mode is enabled if the txn is not a leaf txn.

Checking the txn type in the connEx seems wrong to me. I'd rather have a field on the connEx that explicitly disables stepping, set it in newConnExWithTxn regardless of leaf vs root txn, and hang a comment there saying that internal executors just don't get to have stepping and thus shouldn't be used for queries where the read set overlaps the write set.
But rather, I'm not actually sure why we don't support stepping on leaves. @knz do you remember?

pkg/sql/conn_executor_exec.go, line 575 at r1 (raw file):

	// single/common function. That would be where the stepping mode
	// gets enabled once for all SQL statements executed "underneath".
	if ex.state.mu.txn.Type() != kv.LeafTxn {

nit: switch this to == RootTxn

pkg/sql/conn_executor_exec.go, line 620 at r1 (raw file):

	txn := ex.state.mu.txn

	if !os.ImplicitTxn.Get() && (txn.Type() != kv.LeafTxn && txn.IsSerializablePushAndRefreshNotPossible()) {

Why was this necessary exactly?
I would hope that canAutoRetry is always false below when the connEx was built with newConnExecutorWithTxn, but after looking for a minute I can't say what would make it so.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Checking the txn type in the connEx seems wrong to me. I'd rather have a field on the connEx that explicitly disables stepping, set it in newConnExWithTxn regardless of leaf vs root txn, and hang a comment there saying that internal executors just don't get to have stepping and thus shouldn't be used for queries where the read set overlaps the write set.
But rather, I'm not actually sure why we don't support stepping on leaves. @knz do you remember?

A couple of separate things

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.
internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

Well for whatever reason (I guess some class of crdb_internal functions or casts?) we seem to need internal executors in leaf txns, so I guess these guys don't actually need any stepping since they're not doing mutations.

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.

Well I was thinking we just make all the stepping-stuff no-ops on the argument that no mutations are going on concurrently. I was thinking I'd prefer that over testing for the txn type here as this patch does.

asubiotto

TFTR

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

Well for whatever reason (I guess some class of crdb_internal functions or casts?) we seem to need internal executors in leaf txns, so I guess these guys don't actually need any stepping since they're not doing mutations.

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.

Well I was thinking we just make all the stepping-stuff no-ops on the argument that no mutations are going on concurrently. I was thinking I'd prefer that over testing for the txn type here as this patch does.

Do we really want to make stepping stuff no-ops? It seems weird for me that we would allow a caller to call ConfigureStepping for example and not panic or return an error (as a side note, we probably should never panic in this case) which could lead the caller to mistakenly believe the operation was successful. I added an explicit field on connEx

pkg/sql/conn_executor_exec.go, line 575 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: switch this to == RootTxn

Switched to check the connexecutor boolean

pkg/sql/conn_executor_exec.go, line 620 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why was this necessary exactly?
I would hope that canAutoRetry is always false below when the connEx was built with newConnExecutorWithTxn, but after looking for a minute I can't say what would make it so.

Because IsSerializablePushAndRefreshNotPossible would panic if called on a leaf txn.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root).

It doesn't... Does it? Stepping only records the current sequence number, it doesn't change it. So now I again think that the code would just work on leaves. There'd be nothing about seq nums to communicate back to root because the leaf is by definition r/o so the seq nums haven't advanced. The only information about stepping that'd get lost between leaf and root is the enablement of the stepping on the leaf, but I don't think that's a big deal.

But this patch seems pretty smelly to me now. I think it's quite unfortunate for the connEx to look at what type of transaction it's been given and change its behavior accordingly. Introducing the new transactionSteppingDisabled field did not make it any better - I was hoping we'd find a higher-level concept to latch the change on behavior on (like whether the session is read-only or read-write). I'm thinking the statements being run either need the stepping, in which case it's incorrect to not enable it, or don't in which case I'm hoping the txn code would just work fine and naturally be a no-op.

It also bothers me that further down we check this transactionSteppingDisabled for something that has nothing to do with stepping. I think we can get IsSerializablePushAndRefreshNotPossible to do the right thing one way or another again with no conditional logic in the connEx.

I'd like to make sure that we actually need to work on this. I guess the connEx was never meant to be used with a leaf (or more like the other way around - the leag was never meant to serve more then the specific needs of the TableReaders, not the much broader surface on the connEx). How exactly does the connEx end up being used in a leaf? Does it end up being used on a remote node, or is it happening on the gateway? Is it about casts that need to use it?

pkg/sql/conn_executor_exec.go, line 561 at r2 (raw file):

	// here.)

	// We first ensure stepping mode is enabled if the txn is not a leaf txn.

the new comment is now stale; I'd remove it.

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I'd like to make sure that we actually need to work on this. I guess the connEx was never meant to be used with a leaf (or more like the other way around - the leag was never meant to serve more then the specific needs of the TableReaders, not the much broader surface on the connEx). How exactly does the connEx end up being used in a leaf? Does it end up being used on a remote node, or is it happening on the gateway? Is it about casts that need to use it?

Based on the stack trace in #45924 (comment), this is executed on a remote node to perform a cast. The query in question is retrieving an oid (using queryOidWithJoin). I think disallowing internal executor usage in leaf txns is a non-starter so we have to find a way to make this work.

But this patch seems pretty smelly to me now. I think it's quite unfortunate for the connEx to look at what type of transaction it's been given and change its behavior accordingly. Introducing the new transactionSteppingDisabled field did not make it any better - I was hoping we'd find a higher-level concept to latch the change on behavior on (like whether the session is read-only or read-write). I'm thinking the statements being run either need the stepping, in which case it's incorrect to not enable it, or don't in which case I'm hoping the txn code would just work fine and naturally be a no-op.

I'm not sure if there is a higher-level concept, is there? An internal executor is created without any specific settings. I agree that it's not perfect, but I think we should move forward with this patch because it fixes the immediate panic and is self-contained (we need to backport to release-20.2) and work on improving this situation down the line.

It also bothers me that further down we check this transactionSteppingDisabled for something that has nothing to do with stepping. I think we can get IsSerializablePushAndRefreshNotPossible to do the right thing one way or another again with no conditional logic in the connEx.

I'm happy to change that but I need your input. How do I implement IsSerializablePushAndRefreshNotPossible for leaf txns? Just remove the check or return false unconditionally? If we remove this, we'll just have a single check for a leaf txn (I assume I might as well remove the boolean) which I think is acceptable.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I'm not sure if there is a higher-level concept, is there? An internal executor is created without any specific settings.

I think the TxnCoordSender code for ConfigureStepping and Step will just work and naturally be a no-op when executing read-only statements (which are the only ones that can be executed in a leaf). So I believe we can just remove the respective assertions in those methods that we're not in a leaf. At least I don't see a reason to not do that.
For sanity, I would also try to make newConnExecutorWithTxn set the transaction_read_only session variable on the executor's session.

I'm happy to change that but I need your input. How do I implement IsSerializablePushAndRefreshNotPossible for leaf txns?

My hope here too is that we don't actually need any conditional logic anywhere. I think IsSerializablePushAndRefreshNotPossible works just fine in leaves; I think we can just remove the assertion that it's in a root. Rafa introduced that assertion, but I'm not entirely sure if he had a particular reason for it (although I don't blame him for adding it, I think it's sane to limit what can be called on leaves until it turns out there's a reason do allow it).
I would hope that the canAutoRetry variable below ends up always being false below in executors created with newConnExecutorWithTxn, regardless of whether we're in a root or a leaf. I looked briefly and couldn't convince myself that that's the case, but I think that's the sane thing to do - when the connEx is running in a higher level txn, it doesn't have information about how far back it can rewind, so I hope that the way in which we set it up makes that work. Maybe this line does it?

cockroach/pkg/sql/internal.go

Line 122 in 1d38d0b

lastDelivered: -1,

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

For sanity, I would also try to make newConnExecutorWithTxn set the transaction_read_only session variable on the executor's session.

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

I removed the assertions and the checks and my unit test runs fine. Let's see what CI says about it.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

Sorry, I meant do that if the txn is a leaf (I think looking at the leaf vs root in the ctor is fine).

asubiotto

Alright, I think this is ready for a final look.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

Sorry, I meant do that if the txn is a leaf (I think looking at the leaf vs root in the ctor is fine).

👍 Done.

pkg/sql/conn_executor_exec.go, line 561 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the new comment is now stale; I'd remove it.

Done.

andreimatei

LGTM but let's give @knz a chance to look this over.

The commit message is no longer complete; say something more broadly about internal executors in leaves. Feel free to take authorship too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)

pkg/sql/conn_executor_internal_test.go, line 380 at r3 (raw file):

}

func TestInternalExecutorInLeafTxnDoesNotPanic(t *testing.T) {

if this doesn't need to be in the sql package, move it to internal_test.go; that's where most of the tests are.

asubiotto · 2020-09-17T17:42:41Z

Reworked commit message and moved test.

knz

modulo two polish items see below

Reviewed 1 of 5 files at r1, 1 of 3 files at r2, 4 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/sql/conn_executor.go, line 698 at r4 (raw file):

) *connExecutor {
	ex := s.newConnExecutor(ctx, sd, sdDefaults, stmtBuf, clientComm, memMetrics, srvMetrics, appStats)
	if txn.Type() == kv.LeafTxn {

Can you add an explanatory comment here that motivates this piece of code, for the benefit of the future maintainer.

pkg/sql/txn_state.go, line 217 at r4 (raw file):

	} else {
		if priority != roachpb.UnspecifiedUserPriority {
			panic(fmt.Sprintf("unexpected priority when using an existing txn: %s", priority))

panic(errors.AssertionFailedf(...)) please

Internal executors use a conn executor under the hood, which sets a priority and configures stepping on the transaction it is provided with. Both operations would previously cause panics when used with a leaf transaction. This commit reworks setting transaction priority on leaf transactions to avoid crashing when the new priority is the same as the old one, and allows stepping to be configured on a leaf transaction since doing so is a noop. Fixes cockroachdb#45924 Release note: None Release justification: crash fix

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @knz)

pkg/sql/conn_executor.go, line 698 at r4 (raw file):

Previously, knz (kena) wrote…

Can you add an explanatory comment here that motivates this piece of code, for the benefit of the future maintainer.

Done.

pkg/sql/txn_state.go, line 217 at r4 (raw file):

Previously, knz (kena) wrote…

panic(errors.AssertionFailedf(...)) please

Done.

asubiotto · 2020-09-18T14:56:09Z

TFTR

bors r=knz,andreimatei

craig · 2020-09-18T16:06:46Z

Build succeeded:

GitHub CI (Cockroach)

asubiotto requested a review from jordanlewis September 1, 2020 16:46

asubiotto self-assigned this Sep 1, 2020

asubiotto removed the request for review from jordanlewis September 1, 2020 16:47

asubiotto force-pushed the sql.fix-txn-priority branch from 611b4b6 to 56ad76c Compare September 9, 2020 14:18

asubiotto force-pushed the sql.fix-txn-priority branch from 56ad76c to 94aca84 Compare September 11, 2020 10:20

asubiotto assigned andreimatei Sep 11, 2020

andreimatei commented Sep 14, 2020

View reviewed changes

knz reviewed Sep 14, 2020

View reviewed changes

andreimatei commented Sep 14, 2020

View reviewed changes

asubiotto reviewed Sep 15, 2020

View reviewed changes

asubiotto force-pushed the sql.fix-txn-priority branch from 94aca84 to 9af3b93 Compare September 15, 2020 10:23

andreimatei commented Sep 15, 2020

View reviewed changes

asubiotto reviewed Sep 16, 2020

View reviewed changes

andreimatei commented Sep 16, 2020

View reviewed changes

asubiotto force-pushed the sql.fix-txn-priority branch from 9af3b93 to 1e73ca1 Compare September 16, 2020 15:03

asubiotto reviewed Sep 16, 2020

View reviewed changes

andreimatei commented Sep 16, 2020

View reviewed changes

asubiotto force-pushed the sql.fix-txn-priority branch from 1e73ca1 to 7209aca Compare September 17, 2020 09:15

asubiotto reviewed Sep 17, 2020

View reviewed changes

andreimatei commented Sep 17, 2020

View reviewed changes

asubiotto force-pushed the sql.fix-txn-priority branch from 7209aca to 9a3d56d Compare September 17, 2020 17:42

knz approved these changes Sep 17, 2020

View reviewed changes

asubiotto force-pushed the sql.fix-txn-priority branch from 9a3d56d to bd255e5 Compare September 18, 2020 06:50

asubiotto reviewed Sep 18, 2020

View reviewed changes

craig bot merged commit 9787fcd into cockroachdb:master Sep 18, 2020

asubiotto mentioned this pull request Sep 21, 2020

release-20.2: sql: fix internal executor usage in leaf txn #54613

Merged

andreimatei deleted the sql.fix-txn-priority branch January 21, 2022 17:59

Conversation

andreimatei commented Mar 10, 2020

Uh oh!

cockroach-teamcity commented Mar 10, 2020

Uh oh!

andreimatei commented Mar 10, 2020

Uh oh!

andreimatei commented Apr 15, 2020

Uh oh!

andreimatei commented Aug 21, 2020

Uh oh!

asubiotto commented Sep 9, 2020

Uh oh!

asubiotto commented Sep 11, 2020

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto commented Sep 17, 2020

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto left a comment

Choose a reason for hiding this comment

Uh oh!

asubiotto commented Sep 18, 2020

Uh oh!

craig bot commented Sep 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants