Skip to content

workload: introduce timeout for pre-warming connection pool#101786

Merged
craig[bot] merged 2 commits intomasterfrom
cmd-workload-timeout-conn-est
May 11, 2023
Merged

workload: introduce timeout for pre-warming connection pool#101786
craig[bot] merged 2 commits intomasterfrom
cmd-workload-timeout-conn-est

Conversation

@sean-
Copy link
Copy Markdown
Contributor

@sean- sean- commented Apr 18, 2023

Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire.

Epic: none

@sean- sean- requested a review from tbg April 18, 2023 19:57
@sean- sean- self-assigned this Apr 18, 2023
@sean- sean- requested a review from a team as a code owner April 18, 2023 19:57
@sean- sean- requested review from herkolategan and renatolabs and removed request for a team April 18, 2023 19:57
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

renatolabs pushed a commit to renatolabs/cockroach that referenced this pull request Apr 26, 2023
Epic: none

Release note: None
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have some confusions and I think it could be simplified a bit, in particular around the distribute() function.

Also, meta question, don't we want pre-warming to be optional? ./cockroach workload run X has the flag --tolerate-errors. Ideally we want to be able to run it against a cluster with a node down (but some node reachable) and it would start up. We shouldn't insist on all provided connection strings working.
I think the old behavior was the same, though. I think --tolerate-errors only ever worked reliable if the errors occurred "late enough", when the workload was properly running. Still, food for thought.

@renatolabs
Copy link
Copy Markdown

Friendly ping.

We have a test that frequently times out in connection warmup (after 1h30m) with workload run tpcc --tolerate-errors (#102687).

Copy link
Copy Markdown
Contributor Author

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)


pkg/workload/pgx_helpers.go line 333 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't new code, but I'm wondering why we are not passing MaxConns into distribute but instead are allowing distribute to generate numbers that are larger than what we want, which will then result in the total number of conns being lower than numConns?

We automatically set the connection count if a user specifies 0 connections. In this case, the user specified a specific number of connections, which we will then distribute them across the supplied URLs. We still set a per-pool max connection count. If a user specifies a max connection count of 100, only passes in a handful of URLs, and then sets the number of connections to 10K, we will clamp the number of connections to 100 because that was the max that a user specified. Arguably, there's a bug here where this should only be applied if maxPoolConns is also greater than zero.


pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can we fix this up in the previous for loop? As is, I'm confused by why we add p.maxConns to numWarmupConns before then overwriting p.maxConns with 1. I also don't understand why distribute would ever return a zero for a pool. Can we not fold all of these semantics into distribute and simplify the code here?

Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.


pkg/workload/pgx_helpers.go line 357 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this just minWarmupTime > maxWarmupTime which is 15s > 5m and so this just evaluates to false? Don't you want to do this adjustment after having adjusted warumTime? If we enter the case on line 355 we won't enter the one on 3571.

It is. This switch block used to have 3x case in it, and I prefer using switch+case over if else blocks. The gist of this code block is to create a sliding scale for the warmup time with a min and max, though this wasn't clamping things. Updated.

Footnotes

  1. https://play.golang.com/p/3w3m6qyGedh

@sean- sean- requested a review from a team May 9, 2023 19:34
@sean- sean- requested a review from a team as a code owner May 9, 2023 19:34
@sean- sean- requested a review from a team May 9, 2023 19:34
@sean- sean- requested review from a team as code owners May 9, 2023 19:34
@sean- sean- requested a review from a team May 9, 2023 19:34
@sean- sean- requested review from a team as code owners May 9, 2023 19:34
@sean- sean- requested review from a team May 9, 2023 19:34
@sean- sean- requested review from a team as code owners May 9, 2023 19:34
@sean- sean- requested a review from a team May 9, 2023 19:34
@sean- sean- requested a review from a team as a code owner May 9, 2023 19:34
@sean- sean- removed request for a team, bananabrick, lidorcarmel, michae2 and samiskin May 9, 2023 19:39
Interrupting target instances during prewarming shouldn't cause workload to
proceed: introduce a timeout to prewarming connections.  Connections will have
15s to 5min to warmup before the context will expire.

Epic: none
Release note: None
@sean- sean- force-pushed the cmd-workload-timeout-conn-est branch from bd1cd19 to de709fa Compare May 9, 2023 19:44
Copy link
Copy Markdown
Contributor Author

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)


pkg/workload/pgx_helpers.go line 382 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I can't comment on it but line 395 below (break WARMUP) is dead code. When a Done() channel is closed, Err() is guaranteed to return an error (that's in the contract) so you just need a blanket return err inside of the case.

Done.

Copy link
Copy Markdown
Contributor Author

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg I agree pre-warming should be optional and was trying not to tinker with the original thrust of the code.

@renatolabs : this will provide an upper bound for establishing connections, which will be helpful in flakey environments.

I'm not opposed to making connection prewarming a soft error, in that a failure in warming up connections will allow workload to continue. The expectation that 100% of all connections will succeed when benchmarking a distributed system is unrealistic when running workers with instance preemption. I've pushed a small change that changes WarmupConns() to log on error instead of exit. Plumbing -tolerate-errors down is doable but seems heavy-weight.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)

@sean- sean- requested a review from tbg May 9, 2023 19:59
@sean- sean- force-pushed the cmd-workload-timeout-conn-est branch from fbae134 to 9f71433 Compare May 9, 2023 20:15
Copy link
Copy Markdown
Contributor Author

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)


pkg/workload/pgx_helpers.go line 217 at r4 (raw file):

	if err := m.WarmupConns(ctx, cfg.WarmupConns); err != nil {
		log.Warningf(ctx, "warming up connection pool failed (%v), continuing workload", err)

@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.

CC @srosenberg

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:. I'll make sure this fixes the failures I've been seeing on roachtests (it likely will, especially now that we don't return an error if warmup times out). @tbg should also take another look.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @sean-, @srosenberg, and @tbg)


pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, sean- (Sean Chittenden) wrote…

Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.

This code also suggests poolMaxConns is never <= 0. Otherwise, we could be setting maxConns to 0 here, which is probably not valid.


pkg/workload/pgx_helpers.go line 292 at r3 (raw file):

		// Tune max conns for the pool
		switch {
		case totalNumConns == 0 && poolMaxConns > 0:

Can poolMaxConns be <= 0? According to the documentation, the default is the greater of 4 or runtime.NumCPU(), and custom values are integer greater than 0.

If pgxpool guarantees that invariant, we could get rid of the case below and simplify this logic (took me a while to understand).


pkg/workload/pgx_helpers.go line 217 at r4 (raw file):

Previously, sean- (Sean Chittenden) wrote…

@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.

CC @srosenberg

Makes sense to me.

Copy link
Copy Markdown
Contributor Author

@sean- sean- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)


pkg/workload/pgx_helpers.go line 344 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This code also suggests poolMaxConns is never <= 0. Otherwise, we could be setting maxConns to 0 here, which is probably not valid.

If we set maxConns to 0 here, we rely on the pgxpool defaults.


pkg/workload/pgx_helpers.go line 292 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can poolMaxConns be <= 0? According to the documentation, the default is the greater of 4 or runtime.NumCPU(), and custom values are integer greater than 0.

If pgxpool guarantees that invariant, we could get rid of the case below and simplify this logic (took me a while to understand).

poolMaxConns is passed down from the CLI, so it's possible, yes. pgxpool does not guarantee these minimums. The default, however, uses this heuristic: https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L289-L304

@renatolabs
Copy link
Copy Markdown

Confirming the changes here fix the failures I've been seeing in roachtests.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)

@sean-
Copy link
Copy Markdown
Contributor Author

sean- commented May 11, 2023

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 11, 2023

Build failed:

@sean-
Copy link
Copy Markdown
Contributor Author

sean- commented May 11, 2023

bors r+

@sean-
Copy link
Copy Markdown
Contributor Author

sean- commented May 11, 2023

Failed test was:

  FAIL: //pkg/ccl/serverccl/diagnosticsccl:diagnosticsccl_test (see /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/testlogs/pkg/ccl/serverccl/diagnosticsccl/diagnosticsccl_test/test.log)

and unrelated, from what I can tell based on a quick glance. Re-submitted.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 11, 2023

Build failed (retrying...):

@sean-
Copy link
Copy Markdown
Contributor Author

sean- commented May 11, 2023

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 11, 2023

Already running a review

@craig craig bot merged commit aa2c52b into master May 11, 2023
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 11, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants