Skip to content

kv: use evictionToken to coalesce RangeLookups on SendError#38527

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
jeffrey-xiao:use-eviction-token
Jul 9, 2019
Merged

kv: use evictionToken to coalesce RangeLookups on SendError#38527
craig[bot] merged 1 commit intocockroachdb:masterfrom
jeffrey-xiao:use-eviction-token

Conversation

@jeffrey-xiao
Copy link
Copy Markdown
Contributor

@jeffrey-xiao jeffrey-xiao commented Jun 27, 2019

Also fixes an issue where a meta2 range lookup could be coalesced with a
meta1 range lookup and cause a deadlock.

Fixes #28967

Release note: None

I've collected metrics to compare this PR against master (15d4329) under two workloads while the cluster is running kv50.

  1. Split workload: Two workers splitting random ranges: ALTER TABLE kv.kv SPLIT AT VALUES (CAST(ROUND(RANDOM() * 9223372036854775808) AS INT)) WITH EXPIRATION '1s'
  2. Relocate workload: Two workers relocating random ranges: ALTER TABLE kv.kv EXPERIMENTAL_RELOCATE SELECT ARRAY[%s, %s, %s], CAST(ROUND(RANDOM() * 9223372036854775808) AS INT). The new set of replicas is a random subset of the available replicas of size 3.

More details can be found in this script.

image

@jeffrey-xiao jeffrey-xiao requested review from a team, ajwerner and nvb June 27, 2019 22:17
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/kv/dist_sender_test.go, line 3089 at r1 (raw file):

		rs, err := keys.Range(ba)
		if err != nil {
			t.Fatal(err)

If you call t.Fatal from a goroutine other than the main one, you're going to have a bad time. Instead call t.Error here and then return the error. The test will fail but it won't hang.

FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test

https://golang.org/pkg/testing/#T.FailNow


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

			var kv roachpb.KeyValue
			if err := kv.Value.SetProto(&testMetaRangeDescriptor); err != nil {
				t.Fatal(err)

here too


pkg/kv/dist_sender_test.go, line 3119 at r1 (raw file):

		var kv roachpb.KeyValue
		if err := kv.Value.SetProto(&testUserRangeDescriptor); err != nil {
			t.Fatal(err)

and here


pkg/kv/dist_sender_test.go, line 3155 at r1 (raw file):

	batchWaitGroup.Add(2)
	go func() {

nit: make a helper that takes a string rather than duplicating this code

@jeffrey-xiao jeffrey-xiao force-pushed the use-eviction-token branch 2 times, most recently from d97bee5 to e271833 Compare June 28, 2019 19:20
Copy link
Copy Markdown
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/kv/dist_sender_test.go, line 3089 at r1 (raw file):

Previously, ajwerner wrote…

If you call t.Fatal from a goroutine other than the main one, you're going to have a bad time. Instead call t.Error here and then return the error. The test will fail but it won't hang.

FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test

https://golang.org/pkg/testing/#T.FailNow

There is some retry logic in dist_sender that makes the test hang and not fail hard when I provide an error. I've opted to call waitGroupBatch.Donewhen a fatal error is encountered in the go routines so the test fails fast.

Copy link
Copy Markdown
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/dist_sender_test.go, line 3089 at r1 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

There is some retry logic in dist_sender that makes the test hang and not fail hard when I provide an error. I've opted to call waitGroupBatch.Donewhen a fatal error is encountered in the go routines so the test fails fast.

Ignore my previous comment -- returning nil and calling t.Error causes the test to fail fast.


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

Previously, ajwerner wrote…

here too

Done.


pkg/kv/dist_sender_test.go, line 3119 at r1 (raw file):

Previously, ajwerner wrote…

and here

Done.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1, 1 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

Done.

I'm not seeing the update.


pkg/kv/dist_sender_test.go, line 3053 at r2 (raw file):

}

func TestEvictionTokenCoalesce(t *testing.T) {

Just to confirm, this test fails without your change in DistSender, right?


pkg/util/syncutil/singleflight/singleflight.go, line 154 at r2 (raw file):

	g.mu.Lock()
	defer g.mu.Unlock()
	if g.m == nil {

I don't think you need this. Accessing a nil map should behave correctly. https://play.golang.org/p/Kv4LuuCQe-y

Copy link
Copy Markdown
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm not seeing the update.

I had to override MaxRetries in RPCRetryOptions to get the test to terminate if I'm returning nil, err or br, nil where br.Error = err. Unfortunately, it calls t.Error multiple times per batch so the error stack trace is fairly confusing. I think just returning br, nil is fine since it fails fast and provides a clean error stack.


pkg/kv/dist_sender_test.go, line 3053 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Just to confirm, this test fails without your change in DistSender, right?

Yep, this test fails without supplying the eviction token.

@jeffrey-xiao jeffrey-xiao force-pushed the use-eviction-token branch 2 times, most recently from 1ca847b to f693165 Compare July 2, 2019 19:11
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/cmd/roachtest/bank.go, line 457 at r4 (raw file):

			nodes[i]++
		}
		sort.Ints(nodes)

Could you add a comment explaining why we do this? What exactly was deadlocking?


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

Previously, jeffrey-xiao (Jeffrey Xiao) wrote…

I had to override MaxRetries in RPCRetryOptions to get the test to terminate if I'm returning nil, err or br, nil where br.Error = err. Unfortunately, it calls t.Error multiple times per batch so the error stack trace is fairly confusing. I think just returning br, nil is fine since it fails fast and provides a clean error stack.

Did we explore assigning br.Error = roachpb.NewError(err)? It's still generally bad form to fail the test from within these mocking functions,

@jeffrey-xiao jeffrey-xiao force-pushed the use-eviction-token branch 2 times, most recently from 738fee1 to 7fa15a7 Compare July 3, 2019 20:16
Copy link
Copy Markdown
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/cmd/roachtest/bank.go, line 457 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment explaining why we do this? What exactly was deadlocking?

Oops, misread the code.


pkg/kv/dist_sender_test.go, line 3108 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did we explore assigning br.Error = roachpb.NewError(err)? It's still generally bad form to fail the test from within these mocking functions,

dist_sender didn't properly propagate errors from range_cache, but now br.Error = roachpb.NewError(err) works.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r2, 1 of 2 files at r3, 4 of 4 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffrey-xiao)


pkg/kv/dist_sender_test.go, line 3105 at r5 (raw file):

		if bytes.HasPrefix(rs.Key, keys.Meta1Prefix) {
			// Querying meta 1 range.
			br := &roachpb.BatchResponse{}

The linter is complaining about this shadowing of br.


pkg/kv/range_cache.go, line 146 at r5 (raw file):

		}
	}
	ret.WriteString(string(key))

nit: is there compiler intrinsics to avoid a copy here when casting from bytes to a string? If not can you use Write() and cast to a []byte instead of the string cast?

@jeffrey-xiao jeffrey-xiao requested review from ajwerner and nvb July 8, 2019 20:21
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)


pkg/kv/dist_sender.go, line 1415 at r6 (raw file):

			if err != nil {
				log.VErrEventf(ctx, 1, "range descriptor re-lookup failed: %s", err)
				pErr = roachpb.NewError(err)

What's up with this? It looks like it will be dropped in most cases because we'll retry. Is the idea to return an error if we don't retry because we've timed out? If so, please leave a comment.


pkg/kv/range_cache.go, line 135 at r5 (raw file):

	// to indicate a meta2 lookup.
	if key.AsRawKey().Compare(keys.Meta1KeyMax) < 0 {
		ret.WriteString("1:")

I'm thinking that we should write Meta1Prefix and Meta2Prefix directly. This string doesn't need to be pretty, but we do want to make sure that we don't get any unexpected aliasing between lookups. For instance, imaging what would happen if two lookups ran concurrently with the keys /meta1/test and 1:test right now.


pkg/kv/range_cache.go, line 146 at r5 (raw file):

Previously, ajwerner wrote…

nit: is there compiler intrinsics to avoid a copy here when casting from bytes to a string? If not can you use Write() and cast to a []byte instead of the string cast?

+1

Copy link
Copy Markdown
Contributor Author

@jeffrey-xiao jeffrey-xiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @nvanbenschoten)


pkg/kv/dist_sender_test.go, line 3105 at r5 (raw file):

Previously, ajwerner wrote…

The linter is complaining about this shadowing of br.

Done.

@jeffrey-xiao jeffrey-xiao requested a review from nvb July 9, 2019 00:43
@jeffrey-xiao jeffrey-xiao force-pushed the use-eviction-token branch 2 times, most recently from aa0b328 to 818539f Compare July 9, 2019 17:45
Also fixed an issue where a meta2 range lookup could be coalesced with a
meta1 range lookup and cause a deadlock.

Release note: None
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)

@jeffrey-xiao
Copy link
Copy Markdown
Contributor Author

TFTRs!

bors r+

craig bot pushed a commit that referenced this pull request Jul 9, 2019
38527: kv: use evictionToken to coalesce RangeLookups on SendError r=jeffrey-xiao a=jeffrey-xiao

Also fixes an issue where a meta2 range lookup could be coalesced with a
meta1 range lookup and cause a deadlock.

Fixes #28967

Release note: None

I've collected metrics to compare this PR against master (`15d4329`) under two workloads while the cluster is running `kv50`.

1. Split workload: Two workers splitting random ranges: `ALTER TABLE kv.kv SPLIT AT VALUES (CAST(ROUND(RANDOM() * 9223372036854775808) AS INT)) WITH EXPIRATION '1s'`
2. Relocate workload: Two workers relocating random ranges: `ALTER TABLE kv.kv EXPERIMENTAL_RELOCATE SELECT ARRAY[%s, %s, %s], CAST(ROUND(RANDOM() * 9223372036854775808) AS INT)`. The new set of replicas is a random subset of the available replicas of size 3.

More details can be found in this [script](https://gist.github.com/jeffrey-xiao/2af9de98a56d8921dc7efa0cf7fb5b5e).

![image](https://user-images.githubusercontent.com/8853434/60836298-592e9400-a193-11e9-9c46-920cc2539982.png)

Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 9, 2019

Build succeeded

@craig craig bot merged commit 37a38ed into cockroachdb:master Jul 9, 2019
@jeffrey-xiao jeffrey-xiao deleted the use-eviction-token branch July 9, 2019 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: use evictionToken to coalesce RangeLookups on RangeNotFoundErrors

4 participants