Skip to content

kvserver/concurrency: disable lock-table when not leaseholder#45830

Merged
nvb merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/lockTableDisable
Mar 9, 2020
Merged

kvserver/concurrency: disable lock-table when not leaseholder#45830
nvb merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/lockTableDisable

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Mar 7, 2020

Fixes #38257.
Fixes #45792.
Potentially fixes other instability over the past 4 days.

In #38257, We tracked down that a request was getting stuck in a
lock-table wait-queue that was on a follower replica. This was possible
even though we cleared the lock-table when a replica loses the lease
because the request was re-discovering a lock and inserting it into the
lock-table after the table had been cleared. Once in this state, the
follower's lock-table would never hear about updates to the lock, so
even when the request reached out and resolved the lock, it was not able
to proceed.

This commit fixes this bug by adding a "disabled" state to the lock-table.
When disabled, the lock-table does not track any locks or, by extension,
any wait-queues. This allows requests to proceed past the lock-table and
notice that the replica has lost its lease, which bounces it back to the
client.

@nvb nvb requested review from sumeerbhola and tbg March 7, 2020 05:33
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table.go, line 1631 at r1 (raw file):

	// by AcquireLock or AddDiscoveredLock, we'll never be able to form a
	// wait-queue or update a lock. However, it's not clear whether making
	// that change makes things more clear or less. I'm open to suggestions.

I would leave a comment and remove the code. You have test coverage for this case, right?

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resoled the lock

resolved

:lgtm: after tbg's comment is addressed.

Reviewed 7 of 8 files at r1.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/concurrency/testdata/lock_table/disable, line 39 at r1 (raw file):

infinite loop

what infinite loop is being referred to here?

@nvb nvb force-pushed the nvanbenschoten/lockTableDisable branch from 6fd02a4 to 9f1ae4a Compare March 9, 2020 18:20
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Done.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @sumeerbhola and @tbg)


pkg/kv/kvserver/concurrency/lock_table.go, line 1631 at r1 (raw file):

I would leave a comment and remove the code

Done.

You have test coverage for this case, right?

Yes, all of the subtests in TestConcurrencyManagerBasic/range_state_listener hit this.


pkg/kv/kvserver/concurrency/testdata/lock_table/disable, line 39 at r1 (raw file):

Previously, sumeerbhola wrote…
infinite loop

what infinite loop is being referred to here?

Updated to:

# NOTE: this won't end up in an infinite loop of scanning a disabled lock-table
# and discovering but ignoring the same lock in practice because the second pass
# through evaluation is likely to hit a NotLeaseholderError, bouncing the
# request back to the client.

Fixes cockroachdb#38257.
Fixes cockroachdb#45792.
Potentially fixes other instability over the past 4 days.

In cockroachdb#38257, We tracked down that a request was getting stuck in a
lock-table wait-queue that was on a follower replica. This was possible
even though we cleared the lock-table when a replica loses the lease
because the request was re-discovering a lock and inserting it into the
lock-table after the table had been cleared. Once in this state, the
follower's lock-table would never hear about updates to the lock, so
even when the request reached out and resolved the lock, it was not able
to proceed.

This commit fixes this bug by adding a "disabled" state to the lock-table.
When disabled, the lock-table does not track any locks or, by extension,
any wait-queues. This allows requests to proceed past the lock-table and
notice that the replica has lost its lease, which bounces it back to the
client.
@nvb nvb force-pushed the nvanbenschoten/lockTableDisable branch from 9f1ae4a to 280e694 Compare March 9, 2020 18:22
@nvb nvb merged commit 8108414 into cockroachdb:master Mar 9, 2020
@nvb nvb deleted the nvanbenschoten/lockTableDisable branch March 16, 2020 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: TestDropDatabaseDeleteData failed sql/logictest: TestParallel/subquery_retry_multinode timed out under stress

4 participants