roachtest/tpcc: don't scatter on each tpccbench search iteration by nvb · Pull Request #58014 · cockroachdb/cockroach

nvb · 2020-12-17T02:56:55Z

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to release-20.1 and release-20.2

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's scatterRanges function which often resulted in 9
scatters of the warehouse table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher kv.snapshot_rebalance.max_rate). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen tpccbench/nodes=6/cpu=16/multi-az go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.

cockroach-teamcity · 2020-12-17T02:57:02Z

This change is

ajwerner

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained

nvb · 2020-12-17T07:18:34Z

TFTR!

bors r+

craig · 2020-12-17T07:54:09Z

Build failed:

GitHub CI (Cockroach)

tbg · 2020-12-17T09:08:14Z

@ajwerner you may be interested in the CI failure here:

https://teamcity.cockroachdb.com/viewLog.html?buildId=2526656&buildTypeId=Cockroach_UnitTests

COMMIT; ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): "sql txn" meta={id=534c0646 key=/Table/SystemConfigSpan/Start pri=0.00786986 epo=0 ts=1608191420.048229865,1 min=1608191419.859501730,0 seq=8} lock=true stat=PENDING rts=1608191419.859501730,0 wto=false max=1608191420.359501730,0 (SQLSTATE 40001)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12a2a06]

goroutine 39 [running]:
panic(0x1e36480, 0x519e690)
	/usr/local/go/src/runtime/panic.go:1064 +0x545 fp=0xc000fbd718 sp=0xc000fbd650 pc=0x4504a5
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:212
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:742 +0x413 fp=0xc000fbd748 sp=0xc000fbd718 pc=0x467073
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).Family(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/types/types.go:1124
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).Equivalent(0x0, 0xc000d178c0, 0xc0003b4688)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/types/types.go:1752 +0x26 fp=0xc000fbd790 sp=0xc000fbd748 pc=0x12a2a06
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*operationGenerator).setColumnType(0xc001075320, 0xc00059e540, 0x1a, 0xc000372570, 0xc, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/operation_generator.go:1499 +0x395 fp=0xc000fbd960 sp=0xc000fbd790 pc=0x187a095
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*operationGenerator).randOp(0xc001075320, 0xc00059e540, 0x0, 0x10c6156, 0x5df, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/operation_generator.go:230 +0x20b fp=0xc000fbda78 sp=0xc000fbd960 pc=0x186ed4b
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).runInTxn(0xc001075380, 0xc00059e540, 0x0, 0x0, 0xc00059e540, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:239 +0xff fp=0xc000fbdc38 sp=0xc000fbda78 pc=0x1882f1f
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run(0xc001075380, 0x3fc8840, 0xc00107c040, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:322 +0x125 fp=0xc000fbdea8 sp=0xc000fbdc38 pc=0x1884525
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run-fm(0x3fc8840, 0xc00107c040, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:313 +0x3e fp=0xc000fbdee0 sp=0xc000fbdea8 pc=0x18879fe
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x3fc8840, 0xc00107c040, 0xc000221320, 0xc000910670, 0x0, 0xc000dea7b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:237 +0xaf fp=0xc000fbdf58 sp=0xc000fbdee0 pc=0x13f482f
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func2.1(0xc000df2000, 0xc0000b42c0, 0xc000221320, 0x0, 0x3fc8840, 0xc00107c040, 0xc000910670, 0x0, 0xc000dea7b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:426 +0xec fp=0xc000fbdf98 sp=0xc000fbdf58 pc=0x13f87cc
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000fbdfa0 sp=0xc000fbdf98 pc=0x4890e1
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func2
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:416 +0x111

It's not related to this PR, so giving it another kick

bors r=ajwerner

58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2020-12-17T09:43:29Z

Build failed:

GitHub CI (Cockroach)

tbg · 2020-12-17T10:16:04Z

Huh, same failure. Is it related to this PR?

  CREATE TABLE public.table5 AS SELECT public.table3.col3_0, public.table3.col3_6, public.table3.col3_8, public.table3.col3_9, public.table3.col3_3, public.table3.col3_7, public.table3.col3_14, public.table3.col3_11, public.table3.col3_1, public.table3.col3_2, public.table3.col3_13, public.table3.col3_5 FROM public.table3;
  DROP TABLE public.table3;
COMMIT; ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh): "sql txn" meta={id=7f685523 key=/Table/SystemConfigSpan/Start pri=0.01048706 epo=0 ts=1608197798.596377823,1 min=1608197798.515293728,0 seq=18} lock=true stat=PENDING rts=1608197798.515293728,0 wto=false max=1608197799.015293728,0 (SQLSTATE 40001)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12a2a06]

goroutine 50 [running]:
panic(0x1e36480, 0x519e690)
	/usr/local/go/src/runtime/panic.go:1064 +0x545 fp=0xc0010d9718 sp=0xc0010d9650 pc=0x4504a5
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:212
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:742 +0x413 fp=0xc0010d9748 sp=0xc0010d9718 pc=0x467073
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).Family(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/types/types.go:1124
github.com/cockroachdb/cockroach/pkg/sql/types.(*T).Equivalent(0x0, 0xc00109e3c0, 0xc000e00aa0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/types/types.go:1752 +0x26 fp=0xc0010d9790 sp=0xc0010d9748 pc=0x12a2a06
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*operationGenerator).setColumnType(0xc00107d1d0, 0xc0011020f0, 0x1a, 0xc000408680, 0xc, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/operation_generator.go:1499 +0x395 fp=0xc0010d9960 sp=0xc0010d9790 pc=0x187a095
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*operationGenerator).randOp(0xc00107d1d0, 0xc0011020f0, 0x3, 0x5, 0x0, 0x7f840075abb0, 0x20, 0x28)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/operation_generator.go:230 +0x20b fp=0xc0010d9a78 sp=0xc0010d9960 pc=0x186ed4b
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).runInTxn(0xc00107d230, 0xc0011020f0, 0x0, 0x0, 0xc0011020f0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:239 +0xff fp=0xc0010d9c38 sp=0xc0010d9a78 pc=0x1882f1f
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run(0xc00107d230, 0x3fc8840, 0xc001076900, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:322 +0x125 fp=0xc0010d9ea8 sp=0xc0010d9c38 pc=0x1884525
github.com/cockroachdb/cockroach/pkg/workload/schemachange.(*schemaChangeWorker).run-fm(0x3fc8840, 0xc001076900, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/schemachange/schemachange.go:313 +0x3e fp=0xc0010d9ee0 sp=0xc0010d9ea8 pc=0x18879fe
github.com/cockroachdb/cockroach/pkg/workload/cli.workerRun(0x3fc8840, 0xc001076900, 0xc0004392c0, 0xc000a0a2b0, 0x0, 0xc000b04870)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:237 +0xaf fp=0xc0010d9f58 sp=0xc0010d9ee0 pc=0x13f482f
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func2.1(0xc000ad7030, 0xc000578080, 0xc0004392c0, 0x0, 0x3fc8840, 0xc001076900, 0xc000a0a2b0, 0x1, 0xc000b04870)
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:426 +0xec fp=0xc0010d9f98 sp=0xc0010d9f58 pc=0x13f87cc
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0010d9fa0 sp=0xc0010d9f98 pc=0x4890e1
created by github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func2
	/go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:416 +0x111

ajwerner · 2020-12-17T14:05:36Z

Ack. @jayshrivastava can you take a look at these failures?

jayshrivastava · 2020-12-17T17:03:22Z

@tbg Tracking the issue in #58017. I figured out what causes it and can get a fix up today. Feel free to retry the build since this occurs randomly. You're pretty lucky to have gotten two consecutive segs! It took me all morning to get a repro 😄 .

nvb · 2020-12-17T17:55:17Z

Thanks for tracking this down @jayshrivastava. I'll give this another spin and see if I get lucky again 😃

bors r+

craig · 2020-12-17T18:26:27Z

Build succeeded:

GitHub CI (Cockroach)

nvb requested a review from ajwerner December 17, 2020 02:56

ajwerner approved these changes Dec 17, 2020

View reviewed changes

jayshrivastava mentioned this pull request Dec 17, 2020

roachtest: acceptance/version-upgrade failure #58017

Closed

craig bot merged commit a5d50b6 into cockroachdb:master Dec 17, 2020

This was referenced Dec 17, 2020

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

Merged

release-20.1: roachtest/tpcc: don't scatter on each tpccbench search iteration #58052

Merged

nvb deleted the nvanbenschoten/tpccScatter branch December 31, 2020 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest/tpcc: don't scatter on each tpccbench search iteration#58014

roachtest/tpcc: don't scatter on each tpccbench search iteration#58014
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/tpccScatter

nvb commented Dec 17, 2020

Uh oh!

cockroach-teamcity commented Dec 17, 2020

Uh oh!

ajwerner left a comment

Uh oh!

nvb commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

tbg commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

tbg commented Dec 17, 2020 •

edited

Loading

Uh oh!

ajwerner commented Dec 17, 2020

Uh oh!

jayshrivastava commented Dec 17, 2020

Uh oh!

nvb commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nvb commented Dec 17, 2020

Uh oh!

cockroach-teamcity commented Dec 17, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

tbg commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

tbg commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajwerner commented Dec 17, 2020

Uh oh!

jayshrivastava commented Dec 17, 2020

Uh oh!

nvb commented Dec 17, 2020

Uh oh!

craig bot commented Dec 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tbg commented Dec 17, 2020 •

edited

Loading