Skip to content

sql: fix internal executor usage in leaf txn#45966

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:sql.fix-txn-priority
Sep 18, 2020
Merged

sql: fix internal executor usage in leaf txn#45966
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:sql.fix-txn-priority

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

The internal executor was setting a priority on a leaf txn, which is not
allowed. It was supposed to be a no-op since the priority was the same
as the existing one, but it crashed nonetheless. This patch makes the
no-op even more no-op.

Fixes #45924

Release note: None

Release justification: crash fix

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

@jordanlewis pretty please find me a volunteer to write a test for this. We need something that uses the internal executor on remote notes, through DistSQL. I tried simple stuff in the vein of what SQLSmith did in the linked issue, but couldn't repro immediately and I'd rather pass it to someone else.

@andreimatei
Copy link
Copy Markdown
Contributor Author

SQLSmith hit this again.
Jordan, help me out with testing this.

The query that SQLSmith says it ran, I think, is:

        sqlsmith.go:169,sqlsmith.go:199,test_runner.go:753: error: pq: internal error: "$0"::GEOMETRY: invalid cast: unknown -> geometry
                stmt:
                WITH
                        with_32520 (col_220310)
                                AS (
                                        SELECT
                                                *
                                        FROM
                                                (
                                                        VALUES
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY),
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY),
                                                                ('0101000000000000000000F03F000000000000F03F':::GEOMETRY)
                                                )
                                                        AS tab_95769 (col_220310)
                                        EXCEPT
                                                SELECT * FROM (VALUES (NULL)) AS tab_95770 (col_220311)
                                                INTERSECT ALL SELECT * FROM (VALUES (NULL)) AS tab_95771 (col_220312)
                                ),
                        with_32521 (col_220313, col_220314)
                                AS (
                                        SELECT
                                                *
                                        FROM
                                                (
                                                        VALUES
                                                                (NULL, 'da58c55f-7009-4a6b-ba23-0350742084d8':::UUID),
                                                                (7656877633584715458:::INT8, '5f6cb3d7-aa75-4649-83bf-b79d222aea38':::UUID),
                                                                ((-2693926275475684144):::INT8, '6b330e4a-bf91-4a5f-854d-b1ddc8712721':::UUID)
                                                )
                                                        AS tab_95772 (col_220313, col_220314)
                                )
                SELECT
                        tab_95773.col1_4 AS col_220315, tab_95773.col1_8 AS col_220316
                FROM
                        defaultdb.public.table1@table1_col1_9_col1_11_col1_6_col1_12_col1_3_col1_5_col1_2_col1_0_col1_8_col1_7_col1_1_col1_4_key
                                AS tab_95773,
                        with_32520 AS cte_ref_9570,
                        defaultdb.public.table2@[0] AS tab_95774
                WHERE
                        true
                LIMIT
                        4:::INT8;

@andreimatei
Copy link
Copy Markdown
Contributor Author

friendly ping

@asubiotto asubiotto self-assigned this Sep 1, 2020
@asubiotto asubiotto removed the request for review from jordanlewis September 1, 2020 16:47
@asubiotto asubiotto force-pushed the sql.fix-txn-priority branch from 611b4b6 to 56ad76c Compare September 9, 2020 14:18
@asubiotto
Copy link
Copy Markdown
Contributor

Added a test that reproduces the panic deterministically on master. However, it seems like calls to StartServer hang and timeout while ensuring migrations through the internal executor. I'm not sure from a first glance how that has been caused. @andreimatei could you take a look? Otherwise I'll come back to this later.

@asubiotto
Copy link
Copy Markdown
Contributor

Fixed the hang which was just a double mutex acquisition with the new code. I'm now running into:

panic: txn.ConfigureStepping() only allowed in RootTxn

goroutine 1168 [running]:
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).ConfigureStepping(0xc0018d2630, 0x9065260, 0xc000ad1bc0, 0x9083c01, 0xc002132600)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:1262 +0x177
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc001ac5000, 0x9065260, 0xc000ad1bc0, 0x9083ca0, 0xc002132640, 0x847d134, 0x8, 0x0, 0x0, 0x0, ...)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:575 +0xa05
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc001ac5000, 0x9065260, 0xc000ad1bc0, 0x9083ca0, 0xc002132640, 0x847d134, 0x8, 0x0, 0x0, 0x0, ...)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:116 +0x9e8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc001ac5000, 0x90651a0, 0xc0022b6900, 0x0, 0x0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1459 +0x1c34
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc001ac5000, 0x9065260, 0xc000ad1860, 0xc000a52640, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1388 +0x1fc
github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx.func1(0xc001ac5000, 0x9065260, 0xc000ad1860, 0xc000d4d260, 0xc001c6ef00, 0xc0018d2630, 0xc0020863f0)
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:175 +0x65
created by github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx
	/Users/asubiotto/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:174 +0x325

While running the test. I added a check to only configure stepping and step if the txn is not a leaf txn as well as one other check in conn_executor_exec.go but :ihavenoideawhatimdoing:

This is now RFAL @andreimatei PTAL

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

	// here.)

	// We first ensure stepping mode is enabled if the txn is not a leaf txn.

Checking the txn type in the connEx seems wrong to me. I'd rather have a field on the connEx that explicitly disables stepping, set it in newConnExWithTxn regardless of leaf vs root txn, and hang a comment there saying that internal executors just don't get to have stepping and thus shouldn't be used for queries where the read set overlaps the write set.
But rather, I'm not actually sure why we don't support stepping on leaves. @knz do you remember?


pkg/sql/conn_executor_exec.go, line 575 at r1 (raw file):

	// single/common function. That would be where the stepping mode
	// gets enabled once for all SQL statements executed "underneath".
	if ex.state.mu.txn.Type() != kv.LeafTxn {

nit: switch this to == RootTxn


pkg/sql/conn_executor_exec.go, line 620 at r1 (raw file):

	txn := ex.state.mu.txn

	if !os.ImplicitTxn.Get() && (txn.Type() != kv.LeafTxn && txn.IsSerializablePushAndRefreshNotPossible()) {

Why was this necessary exactly?
I would hope that canAutoRetry is always false below when the connEx was built with newConnExecutorWithTxn, but after looking for a minute I can't say what would make it so.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Checking the txn type in the connEx seems wrong to me. I'd rather have a field on the connEx that explicitly disables stepping, set it in newConnExWithTxn regardless of leaf vs root txn, and hang a comment there saying that internal executors just don't get to have stepping and thus shouldn't be used for queries where the read set overlaps the write set.
But rather, I'm not actually sure why we don't support stepping on leaves. @knz do you remember?

A couple of separate things

  • stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.

  • internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

Well for whatever reason (I guess some class of crdb_internal functions or casts?) we seem to need internal executors in leaf txns, so I guess these guys don't actually need any stepping since they're not doing mutations.

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.

Well I was thinking we just make all the stepping-stuff no-ops on the argument that no mutations are going on concurrently. I was thinking I'd prefer that over testing for the txn type here as this patch does.

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

internal executors should be able to do stepping, as stepping is required to execute single statements (e.g. a mutation with a cascade needs two steps)

Well for whatever reason (I guess some class of crdb_internal functions or casts?) we seem to need internal executors in leaf txns, so I guess these guys don't actually need any stepping since they're not doing mutations.

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root). Since we don't mutually synchronize all the leaves with each other, the monotonic behavior is impossible to implement.

Well I was thinking we just make all the stepping-stuff no-ops on the argument that no mutations are going on concurrently. I was thinking I'd prefer that over testing for the txn type here as this patch does.

Do we really want to make stepping stuff no-ops? It seems weird for me that we would allow a caller to call ConfigureStepping for example and not panic or return an error (as a side note, we probably should never panic in this case) which could lead the caller to mistakenly believe the operation was successful. I added an explicit field on connEx


pkg/sql/conn_executor_exec.go, line 575 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: switch this to == RootTxn

Switched to check the connexecutor boolean


pkg/sql/conn_executor_exec.go, line 620 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why was this necessary exactly?
I would hope that canAutoRetry is always false below when the connEx was built with newConnExecutorWithTxn, but after looking for a minute I can't say what would make it so.

Because IsSerializablePushAndRefreshNotPossible would panic if called on a leaf txn.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

stepping in the leaf is non-sensical, because the stepping increases the sequence number and the sequence number must be monotonic across all leafs (and the root).

It doesn't... Does it? Stepping only records the current sequence number, it doesn't change it. So now I again think that the code would just work on leaves. There'd be nothing about seq nums to communicate back to root because the leaf is by definition r/o so the seq nums haven't advanced. The only information about stepping that'd get lost between leaf and root is the enablement of the stepping on the leaf, but I don't think that's a big deal.


But this patch seems pretty smelly to me now. I think it's quite unfortunate for the connEx to look at what type of transaction it's been given and change its behavior accordingly. Introducing the new transactionSteppingDisabled field did not make it any better - I was hoping we'd find a higher-level concept to latch the change on behavior on (like whether the session is read-only or read-write). I'm thinking the statements being run either need the stepping, in which case it's incorrect to not enable it, or don't in which case I'm hoping the txn code would just work fine and naturally be a no-op.

It also bothers me that further down we check this transactionSteppingDisabled for something that has nothing to do with stepping. I think we can get IsSerializablePushAndRefreshNotPossible to do the right thing one way or another again with no conditional logic in the connEx.

I'd like to make sure that we actually need to work on this. I guess the connEx was never meant to be used with a leaf (or more like the other way around - the leag was never meant to serve more then the specific needs of the TableReaders, not the much broader surface on the connEx). How exactly does the connEx end up being used in a leaf? Does it end up being used on a remote node, or is it happening on the gateway? Is it about casts that need to use it?


pkg/sql/conn_executor_exec.go, line 561 at r2 (raw file):

	// here.)

	// We first ensure stepping mode is enabled if the txn is not a leaf txn.

the new comment is now stale; I'd remove it.

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I'd like to make sure that we actually need to work on this. I guess the connEx was never meant to be used with a leaf (or more like the other way around - the leag was never meant to serve more then the specific needs of the TableReaders, not the much broader surface on the connEx). How exactly does the connEx end up being used in a leaf? Does it end up being used on a remote node, or is it happening on the gateway? Is it about casts that need to use it?

Based on the stack trace in #45924 (comment), this is executed on a remote node to perform a cast. The query in question is retrieving an oid (using queryOidWithJoin). I think disallowing internal executor usage in leaf txns is a non-starter so we have to find a way to make this work.

But this patch seems pretty smelly to me now. I think it's quite unfortunate for the connEx to look at what type of transaction it's been given and change its behavior accordingly. Introducing the new transactionSteppingDisabled field did not make it any better - I was hoping we'd find a higher-level concept to latch the change on behavior on (like whether the session is read-only or read-write). I'm thinking the statements being run either need the stepping, in which case it's incorrect to not enable it, or don't in which case I'm hoping the txn code would just work fine and naturally be a no-op.

I'm not sure if there is a higher-level concept, is there? An internal executor is created without any specific settings. I agree that it's not perfect, but I think we should move forward with this patch because it fixes the immediate panic and is self-contained (we need to backport to release-20.2) and work on improving this situation down the line.

It also bothers me that further down we check this transactionSteppingDisabled for something that has nothing to do with stepping. I think we can get IsSerializablePushAndRefreshNotPossible to do the right thing one way or another again with no conditional logic in the connEx.

I'm happy to change that but I need your input. How do I implement IsSerializablePushAndRefreshNotPossible for leaf txns? Just remove the check or return false unconditionally? If we remove this, we'll just have a single check for a leaf txn (I assume I might as well remove the boolean) which I think is acceptable.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I'm not sure if there is a higher-level concept, is there? An internal executor is created without any specific settings.

I think the TxnCoordSender code for ConfigureStepping and Step will just work and naturally be a no-op when executing read-only statements (which are the only ones that can be executed in a leaf). So I believe we can just remove the respective assertions in those methods that we're not in a leaf. At least I don't see a reason to not do that.
For sanity, I would also try to make newConnExecutorWithTxn set the transaction_read_only session variable on the executor's session.

I'm happy to change that but I need your input. How do I implement IsSerializablePushAndRefreshNotPossible for leaf txns?

My hope here too is that we don't actually need any conditional logic anywhere. I think IsSerializablePushAndRefreshNotPossible works just fine in leaves; I think we can just remove the assertion that it's in a root. Rafa introduced that assertion, but I'm not entirely sure if he had a particular reason for it (although I don't blame him for adding it, I think it's sane to limit what can be called on leaves until it turns out there's a reason do allow it).
I would hope that the canAutoRetry variable below ends up always being false below in executors created with newConnExecutorWithTxn, regardless of whether we're in a root or a leaf. I looked briefly and couldn't convince myself that that's the case, but I think that's the sane thing to do - when the connEx is running in a higher level txn, it doesn't have information about how far back it can rewind, so I hope that the way in which we set it up makes that work. Maybe this line does it?

lastDelivered: -1,

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

For sanity, I would also try to make newConnExecutorWithTxn set the transaction_read_only session variable on the executor's session.

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

I removed the assertions and the checks and my unit test runs fine. Let's see what CI says about it.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @asubiotto, and @knz)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

Sorry, I meant do that if the txn is a leaf (I think looking at the leaf vs root in the ctor is fine).

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think this is ready for a final look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_exec.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't understand this. What guarantees do we have that it's a read-only txn? We still want to support creating internal executors with a root txn that performs mutations.

Sorry, I meant do that if the txn is a leaf (I think looking at the leaf vs root in the ctor is fine).

👍 Done.


pkg/sql/conn_executor_exec.go, line 561 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the new comment is now stale; I'd remove it.

Done.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's give @knz a chance to look this over.

The commit message is no longer complete; say something more broadly about internal executors in leaves. Feel free to take authorship too.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @asubiotto)


pkg/sql/conn_executor_internal_test.go, line 380 at r3 (raw file):

}

func TestInternalExecutorInLeafTxnDoesNotPanic(t *testing.T) {

if this doesn't need to be in the sql package, move it to internal_test.go; that's where most of the tests are.

@asubiotto
Copy link
Copy Markdown
Contributor

Reworked commit message and moved test.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo two polish items see below

Reviewed 1 of 5 files at r1, 1 of 3 files at r2, 4 of 4 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/sql/conn_executor.go, line 698 at r4 (raw file):

) *connExecutor {
	ex := s.newConnExecutor(ctx, sd, sdDefaults, stmtBuf, clientComm, memMetrics, srvMetrics, appStats)
	if txn.Type() == kv.LeafTxn {

Can you add an explanatory comment here that motivates this piece of code, for the benefit of the future maintainer.


pkg/sql/txn_state.go, line 217 at r4 (raw file):

	} else {
		if priority != roachpb.UnspecifiedUserPriority {
			panic(fmt.Sprintf("unexpected priority when using an existing txn: %s", priority))

panic(errors.AssertionFailedf(...)) please

Internal executors use a conn executor under the hood, which sets a priority
and configures stepping on the transaction it is provided with. Both operations
would previously cause panics when used with a leaf transaction.

This commit reworks setting transaction priority on leaf transactions to avoid
crashing when the new priority is the same as the old one, and allows stepping
to be configured on a leaf transaction since doing so is a noop.

Fixes cockroachdb#45924

Release note: None

Release justification: crash fix
Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @knz)


pkg/sql/conn_executor.go, line 698 at r4 (raw file):

Previously, knz (kena) wrote…

Can you add an explanatory comment here that motivates this piece of code, for the benefit of the future maintainer.

Done.


pkg/sql/txn_state.go, line 217 at r4 (raw file):

Previously, knz (kena) wrote…

panic(errors.AssertionFailedf(...)) please

Done.

@asubiotto
Copy link
Copy Markdown
Contributor

TFTR

bors r=knz,andreimatei

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 18, 2020

Build succeeded:

@craig craig bot merged commit 9787fcd into cockroachdb:master Sep 18, 2020
@andreimatei andreimatei deleted the sql.fix-txn-priority branch January 21, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic: set user priority called on leaf txn

5 participants