Skip to content

sql: deflake TestParallel#106495

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
koorosh:sql-deflake-test-parallel
Jul 11, 2023
Merged

sql: deflake TestParallel#106495
craig[bot] merged 1 commit intocockroachdb:masterfrom
koorosh:sql-deflake-test-parallel

Conversation

@koorosh
Copy link
Copy Markdown
Contributor

@koorosh koorosh commented Jul 10, 2023

This change is attempt to fix flakiness of TestParallel test with following updates:

  • increased range max bytes setting as it was increased in a37e053.
  • disabled automatic stats collection for system tables, it has be done in addition to already disabled stats.AutomaticStatisticsClusterMode
    setting.

Resolves: #101614

Release note: None

This change is attempt to fix flakiness of TestParallel test
with following updates:
- increased `range max bytes` setting as it was increased in a37e053.
- disabled automatic stats collection for system tables,
it has be done in addition to already disabled `stats.AutomaticStatisticsClusterMode`
 setting.

 Resolves: cockroachdb#101614

Release note: None
@koorosh koorosh added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Jul 10, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 10, 2023

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Jul 10, 2023
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@koorosh koorosh removed the O-community Originated from the community label Jul 10, 2023
@j82w j82w requested a review from a team July 10, 2023 12:36
Copy link
Copy Markdown
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @gtr, @koorosh, @maryliag, @xinhaoz, and @zachlite)


-- commits line 4 at r1:
This test has not failed since May 12th. Were you able to reproduce the failures locally?


-- commits line 7 at r1:
What is the reason for disabling this setting?

Copy link
Copy Markdown
Contributor Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @gtr, @j82w, @maryliag, @xinhaoz, and @zachlite)


-- commits line 4 at r1:

Previously, j82w (Jake) wrote…

This test has not failed since May 12th. Were you able to reproduce the failures locally?

yes, it is easy to reproduce with --deadlock enabled option.

dev test pkg/sql/logictest:logictest_test --stress --timeout=10860s --deadlock --ignore-cache -v

-- commits line 7 at r1:

Previously, j82w (Jake) wrote…

What is the reason for disabling this setting?

these stats cause test fail on test cluster stop (without graceful shutdown).

@j82w
Copy link
Copy Markdown
Contributor

j82w commented Jul 11, 2023

-- commits line 7 at r1:

Previously, koorosh (Andrii Vorobiov) wrote…

these stats cause test fail on test cluster stop (without graceful shutdown).

Why does it cause the test to fail? Just want to confirm that the failure is not a bug that we ignore by disabling the cluster setting.

Copy link
Copy Markdown
Contributor Author

@koorosh koorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @gtr, @j82w, @maryliag, @xinhaoz, and @zachlite)


-- commits line 7 at r1:

Previously, j82w (Jake) wrote…

Why does it cause the test to fail? Just want to confirm that the failure is not a bug that we ignore by disabling the cluster setting.

  1. Disabled cluster settings don't affect CREATE STATISTICS query which is under test;
  2. Logs don't indicate any failures during stats creation but lots of errors after servers shutdown (see log snippet below)
  3. Other logictests is commonly disable this options to make tests stable ie:
    https://github.com/cockroachdb/cockroach/blob/65cc2617a9e1cd66471edf86671d283e20ec0a0c/pkg/sql/logictest/logic.go#L1330C6-L1335
    -- or --
    // Disable stats collection on system tables before the cluster is started,
    // otherwise there is a race condition where stats may be collected before we
    // can disable them with `SET CLUSTER SETTING`. We disable stats collection on
    // system tables so that we can collect statistics on the imported tables
    // within the retry time limit.
    st := cluster.MakeClusterSettings()
    stats.AutomaticStatisticsOnSystemTables.Override(context.Background(), &st.SV, false)

As an example, here's an excerpt of logs where tests finishes with success but background job posts

...
I230711 19:35:49.466554 1734723 sql/logictest/logic.go:1165 \[-\] 5595 --- progress: testdata/parallel\_test/create\_stats/create\_stats: 1 statements  
\[19:35:49\] --- progress: testdata/parallel\_test/create\_stats/create\_stats: 1 statements  
\[19:35:49\] --- done: testdata/parallel\_test/create\_stats/create\_stats with config : 1 tests, 0 failures  
I230711 19:35:49.470646 1734723 sql/logictest/logic.go:1165 \[-\] 5596 --- done: testdata/parallel\_test/create\_stats/create\_stats with config : 1 tests, 0 failures  
W230711 19:35:50.070992 1647415 2@rpc/clock\_offset.go:291 \[T1,n2,rnode=2,raddr=127.0.0.1:51836,class=default,rpc\] 5597 latency jump (prev avg 23.69ms, current 357.90ms)  
I230711 19:35:50.097918 30356 kv/kvserver/store\_raft.go:670 \[T1,n1,s1,r157/1:/System/tsd/cr.node.sql.…,raft\] 5598 raft ready handling: 0.58s \[append=0.00s, apply=0.00s, , other=0.58s\], wrote \[\]; node might be overloaded  
...
W230711 19:37:40.270654 491251 kv/kvserver/liveness/liveness.go:861  [T1,n2,liveness-hb] 9709  slow heartbeat took 2.783604959s; err=context deadline exceeded
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710  failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 3.031s (given timeout 3s): context deadline exceeded
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +An inability to maintain liveness will prevent a node from participating in a
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +cluster. If this problem persists, it may be a sign of resource starvation or
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +of network connectivity problems. For help troubleshooting, visit:
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +
W230711 19:37:40.271067 491251 kv/kvserver/liveness/liveness.go:763  [T1,n2,liveness-hb] 9710 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues

On 3rd line it indicates that `1 tests, 0 failures  ` so no errors during execution and after test is done lots of errors occurs that mostly indicate incorrect termination of background jobs that due to ungraceful shutdown.


<!-- Sent from Reviewable.io -->

Copy link
Copy Markdown
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @gtr, @maryliag, @xinhaoz, and @zachlite)

@koorosh
Copy link
Copy Markdown
Contributor Author

koorosh commented Jul 11, 2023

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 11, 2023

Build succeeded:

@craig craig bot merged commit cc97c06 into cockroachdb:master Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql/logictest: TestParallel failed

3 participants