cli/zip: issue per-node requests concurrently#59863
cli/zip: issue per-node requests concurrently#59863craig[bot] merged 4 commits intocockroachdb:masterfrom
Conversation
426a084 to
9dcbf15
Compare
tbg
left a comment
There was a problem hiding this comment.
Nice. Do we have any parallel testing though? I know we won't be able to say too much about the output (unless we invest in the logging), but just kicking the tires would be worth it in itself imo.
Reviewed 4 of 11 files at r2, 11 of 11 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @knz)
pkg/cli/context.go, line 321 at r3 (raw file):
// Whether to run the collection sequentially. The code attempts to // access multiple nodes concurrently by default. sequential bool
Just a suggestion: you could have this be a --parallelism flag that defaults to 15 instead. This seems more general and avoids the opaque hard-coding of the value 15.
pkg/cli/zip.go, line 99 at r3 (raw file):
fn func(ctx context.Context, node statuspb.NodeStatus) error, ) error { if zipCtx.sequential {
If this were a parallelism flag, you could lose the special casing here. Actually can't you lose it now? I see below that in this case the semaphore is initialized with a size of one. If the special casing adds value, please add a small comment explaining why.
Oh, ha. It's the determinism for testing, right? You should be able to get that back if below, you do zc.sem.Acquire(ctx, 1) on the main goroutine and the defer on the spawned one (which is maybe even better anyway? Spawns goroutines only when they also commence their work).
pkg/cli/zip_per_node.go, line 109 at r3 (raw file):
go func(ctx context.Context, i int) { defer wg.Done() zc.sem.Acquire(ctx, 1)
The CPU profiles are most useful when they are captured at around the same time. I think this is fine, consider adding a comment though. Ideally, in a future where https://github.com/tbg/cockroach/tree/pprof-experiment is productionized, we want to "make sure" (to the degree the parallelism allows it) to run these at the same time, as it will be more important then (labels then propagate across RPCs, but only if the sender is profiling, too)
808fb94 to
d3b5474
Compare
knz
left a comment
There was a problem hiding this comment.
Do we have any parallel testing though?
Concurrent output was completely busted. I reworked the output code by adding a concurrent progress reporter, then added a test to go with it (sorting the lines to make the output deterministic).
PTAL
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @tbg)
pkg/cli/context.go, line 321 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Just a suggestion: you could have this be a
--parallelismflag that defaults to15instead. This seems more general and avoids the opaque hard-coding of the value 15.
Done.
(NB: you mean concurrency. Concurrency is the degree of flexibility you give to the scheduler. Parallelism is the degree to which things actually run simultaneously.)
pkg/cli/zip.go, line 99 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If this were a
parallelismflag, you could lose the special casing here. Actually can't you lose it now? I see below that in this case the semaphore is initialized with a size of one. If the special casing adds value, please add a small comment explaining why.Oh, ha. It's the determinism for testing, right? You should be able to get that back if below, you do
zc.sem.Acquire(ctx, 1)on the main goroutine and the defer on the spawned one (which is maybe even better anyway? Spawns goroutines only when they also commence their work).
I think I like the special case here because it clarifies / documents what is going on below.
pkg/cli/zip_per_node.go, line 109 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The CPU profiles are most useful when they are captured at around the same time. I think this is fine, consider adding a comment though. Ideally, in a future where https://github.com/tbg/cockroach/tree/pprof-experiment is productionized, we want to "make sure" (to the degree the parallelism allows it) to run these at the same time, as it will be more important then (labels then propagate across RPCs, but only if the sender is profiling, too)
I removed the semaphore here and added your explanation as a comment on the function.
knz
left a comment
There was a problem hiding this comment.
(Rebased on top of #60529)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @tbg)
d3b5474 to
1db22cd
Compare
tbg
left a comment
There was a problem hiding this comment.
Reviewed 12 of 12 files at r4.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @knz, and @tbg)
pkg/cli/zip.go, line 99 at r3 (raw file):
Previously, knz (kena) wrote…
I think I like the special case here because it clarifies / documents what is going on below.
I mostly just worry that we're not testing the "complex" case appropriately, but I see that you added parallel testing too, so no objection.
pkg/cli/zip.go, line 147 at r5 (raw file):
status := serverpb.NewStatusClient(conn) admin := serverpb.NewAdminClient(conn) s.done()
Is it fine to leak these on error returns?
pkg/cli/zip_helpers.go, line 291 at r5 (raw file):
func newZipReporter(format string, args ...interface{}) *zipReporter { return &zipReporter{ flowing: zipCtx.concurrency == 1,
Is this idiomatic? I always try to keep constructors "pure". You could anchor newZipReporter on zipCtx perhaps?
knz
left a comment
There was a problem hiding this comment.
TFYR. I have rebased this and plan to merge if CI is happy.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @knz, and @tbg)
pkg/cli/zip.go, line 147 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is it fine to leak these on error returns?
There were two missing calls to s.fail() in here. Good catch.
pkg/cli/zip_helpers.go, line 291 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this idiomatic? I always try to keep constructors "pure". You could anchor
newZipReporteronzipCtxperhaps?
Good idea. Done.
48e3205 to
7bf14f5
Compare
|
CI currently failing due to #64079. Will address that first |
Release note (bug fix): Fixed a bug where multiple concurrent invocations of `debug zip` could yield cluster instability. This bug had been present since CockroachDB v20.1.
Release note (cli change): `cockroach debug zip` now attempts to pull data from multiple nodes concurrently, up to 15 nodes at a time. This change is meant to accelerate the data collection when a cluster contains multiple nodes. This behavior can be changed with the new command-line flag `--concurrency`.
Release note (cli change): The format of the informational messages printed by `cockroach debug zip` has changed to make it clearer what step is ongoing, when concurrent execution is enabled.
This was suggested as a readability improvement by tbg. Release note: None
|
ok this appears to be green bors r=tbg |
|
Build succeeded: |
|
cc @thtruo |
Requested by @ajwerner and @andreimatei
First commit from #64081.
Release note (cli change):
cockroach debug zipnow attempts to pulldata from multiple nodes concurrently, up to 15 nodes at a time. This
change is meant to accelerate the data collection when a cluster
contains multiple nodes. This behavior can be changed with the new
command-line flag
--concurrency.