workload/tpcc: fix race condition in scattering tables by thoszhang · Pull Request #56942 · cockroachdb/cockroach

thoszhang · 2020-11-20T03:00:06Z

When iterating over the names of tables to scatter, we were sharing the
loop variable between the goroutines, which is racy and caused a test
failure due to reading a garbage string value for the table name. In
general, this function may not scatter each table exactly once. This PR
fixes the bug by moving the read out of the goroutine.

Closes #56892.

Release note (bug fix): Fixed a race condition in the tpcc workload
with the --scatter flag where tables could be scattered multiple times
or not at all.

cockroach-teamcity · 2020-11-20T03:00:13Z

This change is

yuzefovich

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

When iterating over the names of tables to scatter, we were sharing the loop variable between the goroutines, which is racy and caused a test failure due to reading a garbage string value for the table name. In general, this function may not scatter each table exactly once. This PR fixes the bug by moving the read out of the goroutine. Release note (bug fix): Fixed a race condition in the `tpcc` workload with the `--scatter` flag where tables could be scattered multiple times or not at all.

thoszhang · 2020-11-20T04:28:45Z

Updated with a slightly better way of doing this.

nvb

nice find.

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

thoszhang · 2020-11-20T22:53:58Z

TFTRs

bors r+

craig · 2020-11-20T23:29:02Z

Build failed:

GitHub CI (Cockroach)

thoszhang · 2020-11-20T23:36:08Z

	bank.go:189,retry.go:197,bank.go:182,bank.go:430,acceptance.go:104,test_runner.go:755: pq: query execution canceled due to statement timeout

	cluster.go:1651,context.go:140,cluster.go:1640,test_runner.go:836: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 4: dead
		3: 2710
		2: dead
		1: 2195
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 4: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

This is unrelated.

thoszhang · 2020-11-21T00:14:49Z

bors r+

craig · 2020-11-21T01:03:28Z

Build failed:

GitHub CI (Cockroach)

thoszhang · 2020-11-21T01:39:39Z

I'm seeing acceptance/version-upgrade and /acceptance/bank/cluster-recovery failures on other branches, and the acceptance roachtests don't run tpcc at all, so I'm confident these are unrelated flakes.

bors r+

craig · 2020-11-21T03:07:16Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-11-21T04:42:15Z

Build succeeded:

GitHub CI (Cockroach)

Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.

58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.

thoszhang requested a review from nvb November 20, 2020 03:00

yuzefovich approved these changes Nov 20, 2020

View reviewed changes

thoszhang force-pushed the fix-tpcc-scatter-race branch from e303535 to ec75805 Compare November 20, 2020 04:27

nvb approved these changes Nov 20, 2020

View reviewed changes

This was referenced Nov 20, 2020

release-20.2: workload/tpcc: fix race condition in scattering tables #56978

Merged

release-20.1: workload/tpcc: fix race condition in scattering tables #56979

Merged

craig bot merged commit f5f0f11 into cockroachdb:master Nov 21, 2020

nvb mentioned this pull request Dec 17, 2020

roachtest/tpcc: don't scatter on each tpccbench search iteration #58014

Merged

nvb mentioned this pull request Dec 17, 2020

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

Merged

nvb mentioned this pull request Dec 17, 2020

release-20.1: roachtest/tpcc: don't scatter on each tpccbench search iteration #58052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload/tpcc: fix race condition in scattering tables#56942

workload/tpcc: fix race condition in scattering tables#56942
craig[bot] merged 1 commit intocockroachdb:masterfrom
thoszhang:fix-tpcc-scatter-race

thoszhang commented Nov 20, 2020 •

edited

Loading

Uh oh!

cockroach-teamcity commented Nov 20, 2020

Uh oh!

yuzefovich left a comment

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

nvb left a comment

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

craig bot commented Nov 20, 2020

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

thoszhang commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

thoszhang commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thoszhang commented Nov 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Nov 20, 2020

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

craig bot commented Nov 20, 2020

Uh oh!

thoszhang commented Nov 20, 2020

Uh oh!

thoszhang commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

thoszhang commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thoszhang commented Nov 20, 2020 •

edited

Loading