workload: introduce timeout for pre-warming connection pool by sean- · Pull Request #101786 · cockroachdb/cockroach

sean- · 2023-04-18T19:57:56Z

Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire.

Epic: none

cockroach-teamcity · 2023-04-18T19:58:09Z

This change is

Epic: none Release note: None

tbg

Still have some confusions and I think it could be simplified a bit, in particular around the distribute() function.

Also, meta question, don't we want pre-warming to be optional? ./cockroach workload run X has the flag --tolerate-errors. Ideally we want to be able to run it against a cluster with a node down (but some node reachable) and it would start up. We shouldn't insist on all provided connection strings working.
I think the old behavior was the same, though. I think --tolerate-errors only ever worked reliable if the errors occurred "late enough", when the workload was properly running. Still, food for thought.

pkg/workload/pgx_helpers.go

renatolabs · 2023-05-05T14:50:21Z

Friendly ping.

We have a test that frequently times out in connection warmup (after 1h30m) with workload run tpcc --tolerate-errors (#102687).

sean-

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

pkg/workload/pgx_helpers.go line 333 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't new code, but I'm wondering why we are not passing MaxConns into distribute but instead are allowing distribute to generate numbers that are larger than what we want, which will then result in the total number of conns being lower than numConns?

We automatically set the connection count if a user specifies 0 connections. In this case, the user specified a specific number of connections, which we will then distribute them across the supplied URLs. We still set a per-pool max connection count. If a user specifies a max connection count of 100, only passes in a handful of URLs, and then sets the number of connections to 10K, we will clamp the number of connections to 100 because that was the max that a user specified. Arguably, there's a bug here where this should only be applied if maxPoolConns is also greater than zero.

pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can we fix this up in the previous for loop? As is, I'm confused by why we add p.maxConns to numWarmupConns before then overwriting p.maxConns with 1. I also don't understand why distribute would ever return a zero for a pool. Can we not fold all of these semantics into distribute and simplify the code here?

Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.

pkg/workload/pgx_helpers.go line 357 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this just minWarmupTime > maxWarmupTime which is 15s > 5m and so this just evaluates to false? Don't you want to do this adjustment after having adjusted warumTime? If we enter the case on line 355 we won't enter the one on 357¹.

It is. This switch block used to have 3x case in it, and I prefer using switch+case over if else blocks. The gist of this code block is to create a sliding scale for the warmup time with a min and max, though this wasn't clamping things. Updated.

https://play.golang.com/p/3w3m6qyGedh ↩

Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire. Epic: none Release note: None

sean-

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

pkg/workload/pgx_helpers.go line 382 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I can't comment on it but line 395 below (break WARMUP) is dead code. When a Done() channel is closed, Err() is guaranteed to return an error (that's in the contract) so you just need a blanket return err inside of the case.

Done.

sean-

@tbg I agree pre-warming should be optional and was trying not to tinker with the original thrust of the code.

@renatolabs : this will provide an upper bound for establishing connections, which will be helpful in flakey environments.

I'm not opposed to making connection prewarming a soft error, in that a failure in warming up connections will allow workload to continue. The expectation that 100% of all connections will succeed when benchmarking a distributed system is unrealistic when running workers with instance preemption. I've pushed a small change that changes WarmupConns() to log on error instead of exit. Plumbing -tolerate-errors down is doable but seems heavy-weight.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

Epic: none Release note: None Fixes: #102687

sean-

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

pkg/workload/pgx_helpers.go line 217 at r4 (raw file):

	if err := m.WarmupConns(ctx, cfg.WarmupConns); err != nil {
		log.Warningf(ctx, "warming up connection pool failed (%v), continuing workload", err)

@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.

CC @srosenberg

renatolabs

. I'll make sure this fixes the failures I've been seeing on roachtests (it likely will, especially now that we don't return an error if warmup times out). @tbg should also take another look.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @sean-, @srosenberg, and @tbg)

pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, sean- (Sean Chittenden) wrote…

Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.

This code also suggests poolMaxConns is never <= 0. Otherwise, we could be setting maxConns to 0 here, which is probably not valid.

pkg/workload/pgx_helpers.go line 292 at r3 (raw file):

		// Tune max conns for the pool
		switch {
		case totalNumConns == 0 && poolMaxConns > 0:

Can poolMaxConns be <= 0? According to the documentation, the default is the greater of 4 or runtime.NumCPU(), and custom values are integer greater than 0.

If pgxpool guarantees that invariant, we could get rid of the case below and simplify this logic (took me a while to understand).

pkg/workload/pgx_helpers.go line 217 at r4 (raw file):

Previously, sean- (Sean Chittenden) wrote…

@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.

CC @srosenberg

Makes sense to me.

sean-

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This code also suggests poolMaxConns is never <= 0. Otherwise, we could be setting maxConns to 0 here, which is probably not valid.

If we set maxConns to 0 here, we rely on the pgxpool defaults.

pkg/workload/pgx_helpers.go line 292 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can poolMaxConns be <= 0? According to the documentation, the default is the greater of 4 or runtime.NumCPU(), and custom values are integer greater than 0.

If pgxpool guarantees that invariant, we could get rid of the case below and simplify this logic (took me a while to understand).

poolMaxConns is passed down from the CLI, so it's possible, yes. pgxpool does not guarantee these minimums. The default, however, uses this heuristic: https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L289-L304

renatolabs · 2023-05-10T13:47:30Z

Confirming the changes here fix the failures I've been seeing in roachtests.

tbg

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)

sean- · 2023-05-11T16:59:14Z

bors r+

craig · 2023-05-11T17:05:29Z

Build failed:

Bazel Essential CI (Cockroach)

sean- · 2023-05-11T17:55:31Z

bors r+

sean- · 2023-05-11T17:56:10Z

Failed test was:

  FAIL: //pkg/ccl/serverccl/diagnosticsccl:diagnosticsccl_test (see /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/testlogs/pkg/ccl/serverccl/diagnosticsccl/diagnosticsccl_test/test.log)

and unrelated, from what I can tell based on a quick glance. Re-submitted.

craig · 2023-05-11T18:21:09Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

sean- · 2023-05-11T18:45:45Z

bors r+

craig · 2023-05-11T18:45:47Z

Already running a review

craig · 2023-05-11T20:28:17Z

Build succeeded:

Bazel Essential CI (Cockroach)

sean- requested a review from tbg April 18, 2023 19:57

sean- self-assigned this Apr 18, 2023

sean- requested a review from a team as a code owner April 18, 2023 19:57

sean- requested review from herkolategan and renatolabs and removed request for a team April 18, 2023 19:57

renatolabs pushed a commit to renatolabs/cockroach that referenced this pull request Apr 26, 2023

DNM: apply fix in cockroachdb#101786

8e27836

Epic: none Release note: None

renatolabs mentioned this pull request Apr 26, 2023

roachtest: run multiple backup types in backup/mixed-version #100755

Merged

tbg requested changes Apr 27, 2023

View reviewed changes

renatolabs mentioned this pull request May 5, 2023

roachtest: backup/mixed-version failed #102687

Closed

sean- commented May 9, 2023

View reviewed changes

sean- requested a review from a team May 9, 2023 19:34

sean- requested a review from a team as a code owner May 9, 2023 19:34

sean- requested a review from a team May 9, 2023 19:34

sean- requested review from a team as code owners May 9, 2023 19:34

sean- requested a review from a team May 9, 2023 19:34

sean- requested review from a team as code owners May 9, 2023 19:34

sean- requested review from a team May 9, 2023 19:34

sean- requested review from a team as code owners May 9, 2023 19:34

sean- requested a review from a team May 9, 2023 19:34

sean- requested a review from a team as a code owner May 9, 2023 19:34

sean- removed request for a team, bananabrick, lidorcarmel, michae2 and samiskin May 9, 2023 19:39

sean- force-pushed the cmd-workload-timeout-conn-est branch from bd1cd19 to de709fa Compare May 9, 2023 19:44

sean- commented May 9, 2023

View reviewed changes

sean- requested a review from tbg May 9, 2023 19:59

workload: log a warning instead of erroring when warming up connections

9f71433

Epic: none Release note: None Fixes: #102687

sean- force-pushed the cmd-workload-timeout-conn-est branch from fbae134 to 9f71433 Compare May 9, 2023 20:15

sean- commented May 9, 2023

View reviewed changes

renatolabs approved these changes May 9, 2023

View reviewed changes

sean- commented May 9, 2023

View reviewed changes

srosenberg mentioned this pull request May 10, 2023

release-23.1: workload: jitter the teardown of connections to prevent thundering herd #102395

Merged

tbg approved these changes May 11, 2023

View reviewed changes

stevendanna mentioned this pull request May 11, 2023

roachtest: backup-restore/mixed-version failed #103094

Closed

craig bot merged commit aa2c52b into master May 11, 2023

Conversation

sean- commented Apr 18, 2023

Uh oh!

cockroach-teamcity commented Apr 18, 2023

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

renatolabs commented May 5, 2023

Uh oh!

sean- left a comment

Choose a reason for hiding this comment

Footnotes

Uh oh!

sean- left a comment

Choose a reason for hiding this comment

Uh oh!

sean- left a comment

Choose a reason for hiding this comment

Uh oh!

sean- left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

sean- left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented May 10, 2023

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sean- commented May 11, 2023

Uh oh!

craig bot commented May 11, 2023

Uh oh!

sean- commented May 11, 2023

Uh oh!

sean- commented May 11, 2023

Uh oh!

craig bot commented May 11, 2023

Uh oh!

sean- commented May 11, 2023

Uh oh!

craig bot commented May 11, 2023

Uh oh!

craig bot commented May 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants