Skip to content

workload/tpcc: fix race condition in scattering tables#56942

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
thoszhang:fix-tpcc-scatter-race
Nov 21, 2020
Merged

workload/tpcc: fix race condition in scattering tables#56942
craig[bot] merged 1 commit intocockroachdb:masterfrom
thoszhang:fix-tpcc-scatter-race

Conversation

@thoszhang
Copy link
Copy Markdown

@thoszhang thoszhang commented Nov 20, 2020

When iterating over the names of tables to scatter, we were sharing the
loop variable between the goroutines, which is racy and caused a test
failure due to reading a garbage string value for the table name. In
general, this function may not scatter each table exactly once. This PR
fixes the bug by moving the read out of the goroutine.

Closes #56892.

Release note (bug fix): Fixed a race condition in the tpcc workload
with the --scatter flag where tables could be scattered multiple times
or not at all.

@thoszhang thoszhang requested a review from nvb November 20, 2020 03:00
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

When iterating over the names of tables to scatter, we were sharing the
loop variable between the goroutines, which is racy and caused a test
failure due to reading a garbage string value for the table name. In
general, this function may not scatter each table exactly once. This PR
fixes the bug by moving the read out of the goroutine.

Release note (bug fix): Fixed a race condition in the `tpcc` workload
with the `--scatter` flag where tables could be scattered multiple times
or not at all.
@thoszhang thoszhang force-pushed the fix-tpcc-scatter-race branch from e303535 to ec75805 Compare November 20, 2020 04:27
@thoszhang
Copy link
Copy Markdown
Author

Updated with a slightly better way of doing this.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice find.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

@thoszhang
Copy link
Copy Markdown
Author

TFTRs

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 20, 2020

Build failed:

@thoszhang
Copy link
Copy Markdown
Author

	bank.go:189,retry.go:197,bank.go:182,bank.go:430,acceptance.go:104,test_runner.go:755: pq: query execution canceled due to statement timeout

	cluster.go:1651,context.go:140,cluster.go:1640,test_runner.go:836: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 4: dead
		3: 2710
		2: dead
		1: 2195
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 4: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

This is unrelated.

@thoszhang
Copy link
Copy Markdown
Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 21, 2020

Build failed:

@thoszhang
Copy link
Copy Markdown
Author

I'm seeing acceptance/version-upgrade and /acceptance/bank/cluster-recovery failures on other branches, and the acceptance roachtests don't run tpcc at all, so I'm confident these are unrelated flakes.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 21, 2020

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 21, 2020

Build succeeded:

@craig craig bot merged commit f5f0f11 into cockroachdb:master Nov 21, 2020
nvb added a commit to nvb/cockroach that referenced this pull request Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
craig bot pushed a commit that referenced this pull request Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this pull request Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvb added a commit to nvb/cockroach that referenced this pull request Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
nvb added a commit to nvb/cockroach that referenced this pull request Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: alterpk-tpcc-250 failed

4 participants