roachtest/tpcc: don't scatter on each tpccbench search iteration#58014
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Dec 17, 2020
Merged
roachtest/tpcc: don't scatter on each tpccbench search iteration#58014craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.
Member
ajwerner
approved these changes
Dec 17, 2020
Contributor
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained
Contributor
Author
|
TFTR! bors r+ |
Contributor
|
Build failed: |
Member
|
@ajwerner you may be interested in the CI failure here: https://teamcity.cockroachdb.com/viewLog.html?buildId=2526656&buildTypeId=Cockroach_UnitTests It's not related to this PR, so giving it another kick bors r=ajwerner |
craig bot
pushed a commit
that referenced
this pull request
Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Contributor
|
Build failed: |
Member
|
Huh, same failure. Is it related to this PR? |
Contributor
|
Ack. @jayshrivastava can you take a look at these failures? |
Contributor
Contributor
Author
|
Thanks for tracking this down @jayshrivastava. I'll give this another spin and see if I get lucky again 😃 bors r+ |
Contributor
|
Build succeeded: |
This was referenced Dec 17, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.
This needs to be backported to
release-20.1andrelease-20.2In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.
But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's
scatterRangesfunction which often resulted in 9scatters of the
warehousetable instead of 1 scatter of each table in thedatabase. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.
To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher
kv.snapshot_rebalance.max_rate). It was also disruptive as it had thepotential to slow down the test significantly and cause issues like the one were
are fixing here.
With this change, I've seen
tpccbench/nodes=6/cpu=16/multi-azgo from failing6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.