sql: deflake TestParallel#106495
Conversation
This change is attempt to fix flakiness of TestParallel test with following updates: - increased `range max bytes` setting as it was increased in a37e053. - disabled automatic stats collection for system tables, it has be done in addition to already disabled `stats.AutomaticStatisticsClusterMode` setting. Resolves: cockroachdb#101614 Release note: None
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
j82w
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @gtr, @koorosh, @maryliag, @xinhaoz, and @zachlite)
-- commits line 4 at r1:
This test has not failed since May 12th. Were you able to reproduce the failures locally?
-- commits line 7 at r1:
What is the reason for disabling this setting?
koorosh
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @gtr, @j82w, @maryliag, @xinhaoz, and @zachlite)
Previously, j82w (Jake) wrote…
This test has not failed since May 12th. Were you able to reproduce the failures locally?
yes, it is easy to reproduce with --deadlock enabled option.
dev test pkg/sql/logictest:logictest_test --stress --timeout=10860s --deadlock --ignore-cache -v
Previously, j82w (Jake) wrote…
What is the reason for disabling this setting?
these stats cause test fail on test cluster stop (without graceful shutdown).
Previously, koorosh (Andrii Vorobiov) wrote…
Why does it cause the test to fail? Just want to confirm that the failure is not a bug that we ignore by disabling the cluster setting. |
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @gtr, @j82w, @maryliag, @xinhaoz, and @zachlite)
Previously, j82w (Jake) wrote…
Why does it cause the test to fail? Just want to confirm that the failure is not a bug that we ignore by disabling the cluster setting.
- Disabled cluster settings don't affect
CREATE STATISTICSquery which is under test; - Logs don't indicate any failures during stats creation but lots of errors after servers shutdown (see log snippet below)
- Other
logictestsis commonly disable this options to make tests stable ie:
https://github.com/cockroachdb/cockroach/blob/65cc2617a9e1cd66471edf86671d283e20ec0a0c/pkg/sql/logictest/logic.go#L1330C6-L1335
-- or --
cockroach/pkg/sql/importer/import_stmt_test.go
Lines 6659 to 6665 in 65cc261
As an example, here's an excerpt of logs where tests finishes with success but background job posts
...
I230711 19:35:49.466554 1734723 sql/logictest/logic.go:1165 \[-\] 5595 --- progress: testdata/parallel\_test/create\_stats/create\_stats: 1 statements
\[19:35:49\] --- progress: testdata/parallel\_test/create\_stats/create\_stats: 1 statements
\[19:35:49\] --- done: testdata/parallel\_test/create\_stats/create\_stats with config : 1 tests, 0 failures
I230711 19:35:49.470646 1734723 sql/logictest/logic.go:1165 \[-\] 5596 --- done: testdata/parallel\_test/create\_stats/create\_stats with config : 1 tests, 0 failures
W230711 19:35:50.070992 1647415 2@rpc/clock\_offset.go:291 \[T1,n2,rnode=2,raddr=127.0.0.1:51836,class=default,rpc\] 5597 latency jump (prev avg 23.69ms, current 357.90ms)
I230711 19:35:50.097918 30356 kv/kvserver/store\_raft.go:670 \[T1,n1,s1,r157/1:/System/tsd/cr.node.sql.…,raft\] 5598 raft ready handling: 0.58s \[append=0.00s, apply=0.00s, , other=0.58s\], wrote \[\]; node might be overloaded
...
W230711 19:37:40.270654 491251 kv/kvserver/liveness/liveness.go:861 [T1,n2,liveness-hb] 9709 slow heartbeat took 2.783604959s; err=context deadline exceeded
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.031s (given timeout 3s): context deadline exceeded
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 +
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 +An inability to maintain liveness will prevent a node from participating in a
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 +cluster. If this problem persists, it may be a sign of resource starvation or
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 +of network connectivity problems. For help troubleshooting, visit:
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 +
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763 [T1,n2,liveness-hb] 9710 + https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
On 3rd line it indicates that `1 tests, 0 failures ` so no errors during execution and after test is done lots of errors occurs that mostly indicate incorrect termination of background jobs that due to ungraceful shutdown.
<!-- Sent from Reviewable.io -->
|
bors r+ |
|
Build succeeded: |
This change is attempt to fix flakiness of TestParallel test with following updates:
range max bytessetting as it was increased in a37e053.stats.AutomaticStatisticsClusterModesetting.
Resolves: #101614
Release note: None