Skip to content

cli/zip: issue per-node requests concurrently#59863

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
knz:20210205-zip-conc
Apr 22, 2021
Merged

cli/zip: issue per-node requests concurrently#59863
craig[bot] merged 4 commits intocockroachdb:masterfrom
knz:20210205-zip-conc

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Feb 5, 2021

Requested by @ajwerner and @andreimatei
First commit from #64081.

Release note (cli change): cockroach debug zip now attempts to pull
data from multiple nodes concurrently, up to 15 nodes at a time. This
change is meant to accelerate the data collection when a cluster
contains multiple nodes. This behavior can be changed with the new
command-line flag --concurrency.

@knz knz requested review from itsbilal and tbg February 5, 2021 20:49
@knz knz requested a review from a team as a code owner February 5, 2021 20:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz force-pushed the 20210205-zip-conc branch from 426a084 to 9dcbf15 Compare February 5, 2021 20:59
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Do we have any parallel testing though? I know we won't be able to say too much about the output (unless we invest in the logging), but just kicking the tires would be worth it in itself imo.

Reviewed 4 of 11 files at r2, 11 of 11 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @knz)


pkg/cli/context.go, line 321 at r3 (raw file):

	// Whether to run the collection sequentially. The code attempts to
	// access multiple nodes concurrently by default.
	sequential bool

Just a suggestion: you could have this be a --parallelism flag that defaults to 15 instead. This seems more general and avoids the opaque hard-coding of the value 15.


pkg/cli/zip.go, line 99 at r3 (raw file):

	fn func(ctx context.Context, node statuspb.NodeStatus) error,
) error {
	if zipCtx.sequential {

If this were a parallelism flag, you could lose the special casing here. Actually can't you lose it now? I see below that in this case the semaphore is initialized with a size of one. If the special casing adds value, please add a small comment explaining why.

Oh, ha. It's the determinism for testing, right? You should be able to get that back if below, you do zc.sem.Acquire(ctx, 1) on the main goroutine and the defer on the spawned one (which is maybe even better anyway? Spawns goroutines only when they also commence their work).


pkg/cli/zip_per_node.go, line 109 at r3 (raw file):

		go func(ctx context.Context, i int) {
			defer wg.Done()
			zc.sem.Acquire(ctx, 1)

The CPU profiles are most useful when they are captured at around the same time. I think this is fine, consider adding a comment though. Ideally, in a future where https://github.com/tbg/cockroach/tree/pprof-experiment is productionized, we want to "make sure" (to the degree the parallelism allows it) to run these at the same time, as it will be more important then (labels then propagate across RPCs, but only if the sender is profiling, too)

@knz knz force-pushed the 20210205-zip-conc branch 2 times, most recently from 808fb94 to d3b5474 Compare February 12, 2021 11:01
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any parallel testing though?

Concurrent output was completely busted. I reworked the output code by adding a concurrent progress reporter, then added a test to go with it (sorting the lines to make the output deterministic).

PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @tbg)


pkg/cli/context.go, line 321 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just a suggestion: you could have this be a --parallelism flag that defaults to 15 instead. This seems more general and avoids the opaque hard-coding of the value 15.

Done.

(NB: you mean concurrency. Concurrency is the degree of flexibility you give to the scheduler. Parallelism is the degree to which things actually run simultaneously.)


pkg/cli/zip.go, line 99 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If this were a parallelism flag, you could lose the special casing here. Actually can't you lose it now? I see below that in this case the semaphore is initialized with a size of one. If the special casing adds value, please add a small comment explaining why.

Oh, ha. It's the determinism for testing, right? You should be able to get that back if below, you do zc.sem.Acquire(ctx, 1) on the main goroutine and the defer on the spawned one (which is maybe even better anyway? Spawns goroutines only when they also commence their work).

I think I like the special case here because it clarifies / documents what is going on below.


pkg/cli/zip_per_node.go, line 109 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The CPU profiles are most useful when they are captured at around the same time. I think this is fine, consider adding a comment though. Ideally, in a future where https://github.com/tbg/cockroach/tree/pprof-experiment is productionized, we want to "make sure" (to the degree the parallelism allows it) to run these at the same time, as it will be more important then (labels then propagate across RPCs, but only if the sender is profiling, too)

I removed the semaphore here and added your explanation as a comment on the function.

Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Rebased on top of #60529)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @tbg)

@knz knz force-pushed the 20210205-zip-conc branch from d3b5474 to 1db22cd Compare February 15, 2021 13:15
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 12 of 12 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal, @knz, and @tbg)


pkg/cli/zip.go, line 99 at r3 (raw file):

Previously, knz (kena) wrote…

I think I like the special case here because it clarifies / documents what is going on below.

I mostly just worry that we're not testing the "complex" case appropriately, but I see that you added parallel testing too, so no objection.


pkg/cli/zip.go, line 147 at r5 (raw file):

	status := serverpb.NewStatusClient(conn)
	admin := serverpb.NewAdminClient(conn)
	s.done()

Is it fine to leak these on error returns?


pkg/cli/zip_helpers.go, line 291 at r5 (raw file):

func newZipReporter(format string, args ...interface{}) *zipReporter {
	return &zipReporter{
		flowing: zipCtx.concurrency == 1,

Is this idiomatic? I always try to keep constructors "pure". You could anchor newZipReporter on zipCtx perhaps?

@tbg tbg self-requested a review February 16, 2021 09:32
@knz knz force-pushed the 20210205-zip-conc branch from 1db22cd to f3e3d43 Compare April 22, 2021 13:31
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFYR. I have rebased this and plan to merge if CI is happy.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @knz, and @tbg)


pkg/cli/zip.go, line 147 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is it fine to leak these on error returns?

There were two missing calls to s.fail() in here. Good catch.


pkg/cli/zip_helpers.go, line 291 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this idiomatic? I always try to keep constructors "pure". You could anchor newZipReporter on zipCtx perhaps?

Good idea. Done.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 22, 2021

CI currently failing due to #64079. Will address that first

Release note (bug fix): Fixed a bug where multiple concurrent
invocations of `debug zip` could yield cluster instability. This bug
had been present since CockroachDB v20.1.
knz added 3 commits April 22, 2021 18:50
Release note (cli change): `cockroach debug zip` now attempts to pull
data from multiple nodes concurrently, up to 15 nodes at a time. This
change is meant to accelerate the data collection when a cluster
contains multiple nodes. This behavior can be changed with the new
command-line flag `--concurrency`.
Release note (cli change): The format of the informational messages
printed by `cockroach debug zip` has changed to make it clearer
what step is ongoing, when concurrent execution is enabled.
This was suggested as a readability improvement by tbg.

Release note: None
@knz knz force-pushed the 20210205-zip-conc branch from ffb4a20 to a623493 Compare April 22, 2021 16:51
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 22, 2021

ok this appears to be green

bors r=tbg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 22, 2021

Build succeeded:

@craig craig bot merged commit 97955d4 into cockroachdb:master Apr 22, 2021
@knz knz deleted the 20210205-zip-conc branch April 22, 2021 18:49
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 22, 2021

cc @thtruo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants