Skip to content

jobs: prevent a deadlock during upgrades#100579

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
knz:20230404-job-deadlock
Apr 6, 2023
Merged

jobs: prevent a deadlock during upgrades#100579
craig[bot] merged 3 commits intocockroachdb:masterfrom
knz:20230404-job-deadlock

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Apr 4, 2023

Needed for #100436.
Informs #100578
Epic: None

Prior to this patch, if multiple SQL instances were started
side-by-side but behind on migrations, they would deadlock on
performing their migrations because distsql on each instance would be
unable to reach other other instances.

More generally, we're finding it undesirable for the jobs subsystem to
operate on system.jobs / job_info using distributed queries.

This patch fixes it by disabling query distribution during job
operations.

The testing here is implicit when the test TestServerControllerMultiNodeTenantStartup is stressed - this used to deadlock under stress without this patch.

@knz knz requested review from adityamaru and yuzefovich April 4, 2023 11:12
@knz knz requested a review from a team as a code owner April 4, 2023 11:12
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 4, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 4, 2023

The CI failure seems unrelated to this change (maybe_stressrace - see internal discussion). The jobs package passes a stress test on my machine.

@knz knz force-pushed the 20230404-job-deadlock branch from c51d193 to add193d Compare April 4, 2023 13:41
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I believe at some point we already had a code like this (i.e. disabling DistSQL either with old upgrade manager (I think the upgrades have been changed recently, right?) or at about preStart of the SQL server), but I couldn't quickly find it right now. Do you recall something like this?

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru)

@knz knz force-pushed the 20230404-job-deadlock branch from add193d to 383d50b Compare April 5, 2023 17:18
@knz knz requested a review from a team as a code owner April 5, 2023 17:18
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 5, 2023

i might add a test to check the override mechanism on its own

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 5, 2023

@yuzefovich could you have another look?

and @rafiss I'd like to know if you support this general direction.

@knz knz requested review from rafiss and yuzefovich April 5, 2023 17:24
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but I agree another pair of eyes would be good here.

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @knz, and @rafiss)


pkg/sql/internal.go line 1527 at r2 (raw file):

}

// SessionDataOverride returns a function that can be used to override some

nit: perhaps s/returns a/is a/g.

Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @knz)


pkg/sql/internal.go line 1529 at r2 (raw file):

// SessionDataOverride returns a function that can be used to override some
// fields in the session data.
type SessionDataOverride = func(sd *sessiondata.SessionData)

maybe we could add some commentary on how this differs from sessiondata.InternalExecutorOverride.

some details worth mentioning: this SessionDataOverride one is used when creating the isql.DB, the sessiondata.InternalExecutorOverride is used when invoking the internal executor and it overrides SessionDataOverride

@knz knz force-pushed the 20230404-job-deadlock branch from 383d50b to 0709809 Compare April 5, 2023 21:35
@knz knz requested a review from a team as a code owner April 5, 2023 21:35
@knz knz force-pushed the 20230404-job-deadlock branch from 0709809 to 9446118 Compare April 5, 2023 21:36
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added unit testing for the new mechanism.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @rafiss, and @yuzefovich)


pkg/sql/internal.go line 1527 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps s/returns a/is a/g.

Done.


pkg/sql/internal.go line 1529 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

maybe we could add some commentary on how this differs from sessiondata.InternalExecutorOverride.

some details worth mentioning: this SessionDataOverride one is used when creating the isql.DB, the sessiondata.InternalExecutorOverride is used when invoking the internal executor and it overrides SessionDataOverride

Good idea, done.

@knz knz force-pushed the 20230404-job-deadlock branch from 9446118 to e5113fa Compare April 5, 2023 21:44
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @knz, and @rafiss)


pkg/sql/internal_test.go line 636 at r4 (raw file):

		row, err := txn.QueryRow(ctx, "test", txn.KV(), "show default_int_size")
		require.NoError(t, err)
		assert.Equal(t, "'0'", row[0].String())

nit: why use both require and assert? The former could handle the latter's cases too I think.

Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @adityamaru, @rafiss, and @yuzefovich)


pkg/sql/internal_test.go line 636 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: why use both require and assert? The former could handle the latter's cases too I think.

require does log.Fatal under the hood; so if there are multiple errors the first one will mask errors that came afterwards.

With assert, if there are multiple failures I get to see them all side-by-side.

@knz knz force-pushed the 20230404-job-deadlock branch from e5113fa to 6cf1f5f Compare April 5, 2023 22:00
@knz knz added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Apr 5, 2023
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 5, 2023

bors r=yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 6, 2023

This PR was included in a batch that timed out, it will be automatically retried

knz added 3 commits April 6, 2023 14:07
Prior to this patch, if multiple SQL instances were started
side-by-side but behind on migrations, they would deadlock on
performing their migrations because distsql on each instance would be
unable to reach other other instances.

More generally, we're finding it undesirable for the jobs subsystem to
operate on system.jobs / job_info using distributed queries.

This patch fixes it by disabling query distribution during job
operations.

Release note: None
@knz knz force-pushed the 20230404-job-deadlock branch from 6cf1f5f to 9d4a2a1 Compare April 6, 2023 12:32
@knz knz requested a review from a team as a code owner April 6, 2023 12:32
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 6, 2023

Canceled.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 6, 2023

bors r=yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 6, 2023

Build succeeded:

@craig craig bot merged commit 15182c7 into cockroachdb:master Apr 6, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 6, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from b776994 to blathers/backport-release-23.1-100579: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants