workload: introduce timeout for pre-warming connection pool#101786
workload: introduce timeout for pre-warming connection pool#101786craig[bot] merged 2 commits intomasterfrom
Conversation
Epic: none Release note: None
tbg
left a comment
There was a problem hiding this comment.
Still have some confusions and I think it could be simplified a bit, in particular around the distribute() function.
Also, meta question, don't we want pre-warming to be optional? ./cockroach workload run X has the flag --tolerate-errors. Ideally we want to be able to run it against a cluster with a node down (but some node reachable) and it would start up. We shouldn't insist on all provided connection strings working.
I think the old behavior was the same, though. I think --tolerate-errors only ever worked reliable if the errors occurred "late enough", when the workload was properly running. Still, food for thought.
|
Friendly ping. We have a test that frequently times out in connection warmup (after 1h30m) with |
sean-
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)
pkg/workload/pgx_helpers.go line 333 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't new code, but I'm wondering why we are not passing
MaxConnsintodistributebut instead are allowingdistributeto generate numbers that are larger than what we want, which will then result in the total number of conns being lower thannumConns?
We automatically set the connection count if a user specifies 0 connections. In this case, the user specified a specific number of connections, which we will then distribute them across the supplied URLs. We still set a per-pool max connection count. If a user specifies a max connection count of 100, only passes in a handful of URLs, and then sets the number of connections to 10K, we will clamp the number of connections to 100 because that was the max that a user specified. Arguably, there's a bug here where this should only be applied if maxPoolConns is also greater than zero.
pkg/workload/pgx_helpers.go line 344 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can we fix this up in the previous for loop? As is, I'm confused by why we add
p.maxConnstonumWarmupConnsbefore then overwritingp.maxConnswith 1. I also don't understand whydistributewould ever return a zero for a pool. Can we not fold all of these semantics intodistributeand simplify the code here?
Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.
pkg/workload/pgx_helpers.go line 357 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this just
minWarmupTime > maxWarmupTimewhich is15s > 5mand so this just evaluates to false? Don't you want to do this adjustment after having adjustedwarumTime? If we enter thecaseon line 355 we won't enter the one on 3571.
It is. This switch block used to have 3x case in it, and I prefer using switch+case over if else blocks. The gist of this code block is to create a sliding scale for the warmup time with a min and max, though this wasn't clamping things. Updated.
Footnotes
Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire. Epic: none Release note: None
bd1cd19 to
de709fa
Compare
sean-
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)
pkg/workload/pgx_helpers.go line 382 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I can't comment on it but line 395 below (
break WARMUP) is dead code. When aDone()channel is closed,Err()is guaranteed to return an error (that's in the contract) so you just need a blanketreturn errinside of thecase.
Done.
sean-
left a comment
There was a problem hiding this comment.
@tbg I agree pre-warming should be optional and was trying not to tinker with the original thrust of the code.
@renatolabs : this will provide an upper bound for establishing connections, which will be helpful in flakey environments.
I'm not opposed to making connection prewarming a soft error, in that a failure in warming up connections will allow workload to continue. The expectation that 100% of all connections will succeed when benchmarking a distributed system is unrealistic when running workers with instance preemption. I've pushed a small change that changes WarmupConns() to log on error instead of exit. Plumbing -tolerate-errors down is doable but seems heavy-weight.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)
Epic: none Release note: None Fixes: #102687
fbae134 to
9f71433
Compare
sean-
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)
pkg/workload/pgx_helpers.go line 217 at r4 (raw file):
if err := m.WarmupConns(ctx, cfg.WarmupConns); err != nil { log.Warningf(ctx, "warming up connection pool failed (%v), continuing workload", err)
@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.
CC @srosenberg
renatolabs
left a comment
There was a problem hiding this comment.
. I'll make sure this fixes the failures I've been seeing on roachtests (it likely will, especially now that we don't return an error if warmup times out). @tbg should also take another look.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @sean-, @srosenberg, and @tbg)
pkg/workload/pgx_helpers.go line 344 at r1 (raw file):
Previously, sean- (Sean Chittenden) wrote…
Please take a look at the updated revision. I didn't like how this was earlier, either, and think this is easier to follow.
This code also suggests poolMaxConns is never <= 0. Otherwise, we could be setting maxConns to 0 here, which is probably not valid.
pkg/workload/pgx_helpers.go line 292 at r3 (raw file):
// Tune max conns for the pool switch { case totalNumConns == 0 && poolMaxConns > 0:
Can poolMaxConns be <= 0? According to the documentation, the default is the greater of 4 or runtime.NumCPU(), and custom values are integer greater than 0.
If pgxpool guarantees that invariant, we could get rid of the case below and simplify this logic (took me a while to understand).
pkg/workload/pgx_helpers.go line 217 at r4 (raw file):
Previously, sean- (Sean Chittenden) wrote…
@tbg / @renatolabs : this is a change in behavior, but I think would help our overall testing with the new max connection timeouts added elsewhere in this PR.
CC @srosenberg
Makes sense to me.
sean-
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @tbg)
pkg/workload/pgx_helpers.go line 344 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This code also suggests
poolMaxConnsis never<= 0. Otherwise, we could be settingmaxConnsto0here, which is probably not valid.
If we set maxConns to 0 here, we rely on the pgxpool defaults.
pkg/workload/pgx_helpers.go line 292 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can
poolMaxConnsbe<= 0? According to the documentation, the default isthe greater of 4 or runtime.NumCPU(), and custom values areinteger greater than 0.If
pgxpoolguarantees that invariant, we could get rid of thecasebelow and simplify this logic (took me a while to understand).
poolMaxConns is passed down from the CLI, so it's possible, yes. pgxpool does not guarantee these minimums. The default, however, uses this heuristic: https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L289-L304
|
Confirming the changes here fix the failures I've been seeing in roachtests. |
tbg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
|
bors r+ |
|
Build failed: |
|
bors r+ |
|
Failed test was: and unrelated, from what I can tell based on a quick glance. Re-submitted. |
|
Build failed (retrying...): |
|
bors r+ |
|
Already running a review |
|
Build succeeded: |
Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire.
Epic: none