Skip to content

server: expose goroutine dump with labels in cli/debug#74070

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
mneverov:debug-1
Dec 22, 2021
Merged

server: expose goroutine dump with labels in cli/debug#74070
craig[bot] merged 1 commit intocockroachdb:masterfrom
mneverov:debug-1

Conversation

@mneverov
Copy link
Copy Markdown
Contributor

server: expose goroutine dump with labels in cli/debug

Fixes: #73539

Release note: None

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Dec 19, 2021

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I have added a few people who may be able to assist in reviewing:

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Dec 19, 2021
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl blathers-crl bot requested a review from dt December 19, 2021 14:25
@mneverov mneverov marked this pull request as draft December 19, 2021 16:44
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Dec 19, 2021

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)

a discussion (no related file):
@dt ptal.

I understood the task as "do not replace existing goroutine dump with format?debug=2, but add another with format ?debug=1. That's why I just plumbed the new dump to the existing one and separated them with ----.
It can be more sophisticated and a new flag can be added cockroach debug zip --goroutine-dump-with-labels. It would require introducing a new stacks type like StacksType_GOROUTINE_STACKS_LABELS. Don't know the usecase, hence chose the simplest implementation.
The test TestStatusLocalStacks was failing under the testrace because it reads goroutine labels at the same time pebble table_cache is adding/deleting them. Since this functionality (debug command) is only for debugging, I skipped the test under race. I don't see other options tbh.
Finally, I noticed goroutinedumper that does the same with the old ?debug=2 format. Does it need to be fixed too?


@mneverov mneverov marked this pull request as ready for review December 19, 2021 20:12
@dt
Copy link
Copy Markdown
Contributor

dt commented Dec 20, 2021

@mneverov Thanks for picking this up!

You reading matches my intent -- don't replace the the current debug=2 output which we generally find most useful, but add the debug=1 output along-side.

That said, I do wonder if just concatenating it as is currently done here might lead to a little bit of confusion in our current debugging workflows -- we often, when debugging, grep these files to answer questions like "how many threads are running x" or something, and having both in the same file could potentially be confusing?

Maybe something we could just adjust to, but I wonder how hard it'd be to make a separate file, side-by-side? I suppose that'd require parameterizing the endpoint as to which it returns, and then having cli/debug fetch them separately, to separate files -- I'm not sure if that's a practical suggestion or not. If not, we could see how people who use these regularly (our internal support folks) would feel about the single combined file as it stands here. Does a separate file seem like it'd be much more complex?

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 20, 2021

Thanks for working on this, Max! Haven't reviewed the code but my take on the above comments is that we would want two separate files (where the current goroutine dump file keeps the same name and the ?debug=1 version gets a new name).

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added debug parameter to the stack endpoint.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)

}
return &serverpb.JSONResponse{Data: buf[:length]}, nil
var buf bytes.Buffer

Copy link
Copy Markdown
Contributor

@dt dt Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to default to 2 if req.Debug is 0, e.g.

debug := 2
if req.Debug > 0 {
  debug = int(req.Debug)
}

So that if an older client, that doesn't set req.Debug at all, is asking for stacks from a new node, it still gets what it expects (2, rather than 0) ?

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)


pkg/server/status.go, line 1235 at r2 (raw file):

Previously, dt (David Taylor) wrote…

I think we might want to default to 2 if req.Debug is 0, e.g.

debug := 2
if req.Debug > 0 {
  debug = int(req.Debug)
}

So that if an older client, that doesn't set req.Debug at all, is asking for stacks from a new node, it still gets what it expects (2, rather than 0) ?

Agree, I was thinking about introducing a enum but decided against. Fixed as suggested.

@tbg tbg requested a review from dt December 21, 2021 08:34
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Had some suggestions, but don't think this is more than a round of two of reviews away from being able to merge, thank you for taking this on!

Reviewed 5 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status.go, line 1239 at r3 (raw file):

			debug = int(req.Debug)
		}
		if err = pprof.Lookup("goroutine").WriteTo(&buf, debug); err != nil {

if err :=


pkg/server/serverpb/status.proto, line 570 at r3 (raw file):

  string node_id = 1;
  StacksType type = 2;
  int64 debug = 3;

Isn't this what the StacksType should already communicate? Right now, there is only one enum value:

enum StacksType {
GOROUTINE_STACKS = 0;
reserved 1;
}

There used to be another one back in the day (when we had heavier C dependencies) but I think at this point it would be reasonable to introduce a second enum value

GOROUTINE_STACKS_DEBUG_1 (and add a comment that GOROUTINE_STACKS corresponds to GOROUTINE_STACKS_DEBUG_2; I think we probably want to avoid actually renaming the option since it might be used through the JSON API).

This would then replace the debug flag (we'd map GOROUTINE_STACKS to 2, and GOROUTINE_STACKS_DEBUG_1 to 1 in the Stacks implementation)


pkg/server/status_test.go, line 495 at r3 (raw file):

}

// TestStatusLocalStacks verifies that goroutine stack traces are available

Any reason you moved this code block? Code movement makes reviews difficult by obscuring the diff. Could you move this back up above TestStatusJson?


pkg/server/status_test.go, line 500 at r3 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	// skipping because statusServer#Stacks reads goroutine labels simultaneously

// Skipping ... (*statusServer).Stacks reads goroutine ....

Are you saying that the problem is that by the time you dump stack sometimes pebble hasn't managed to write the labels yet? I think that race has nothing to do with stressrace; this can in theory always happen and the solution would be to use testutils.SucceedsSoon in the test below (i.e. after a little bit of waiting we see the labels). With that done, I don't see a reason for this test to be flaky.

@tbg tbg self-requested a review December 21, 2021 08:34
@mneverov mneverov force-pushed the debug-1 branch 2 times, most recently from b0accb6 to ff21e79 Compare December 21, 2021 11:11
Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/serverpb/status.proto, line 570 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this what the StacksType should already communicate? Right now, there is only one enum value:

enum StacksType {
GOROUTINE_STACKS = 0;
reserved 1;
}

There used to be another one back in the day (when we had heavier C dependencies) but I think at this point it would be reasonable to introduce a second enum value

GOROUTINE_STACKS_DEBUG_1 (and add a comment that GOROUTINE_STACKS corresponds to GOROUTINE_STACKS_DEBUG_2; I think we probably want to avoid actually renaming the option since it might be used through the JSON API).

This would then replace the debug flag (we'd map GOROUTINE_STACKS to 2, and GOROUTINE_STACKS_DEBUG_1 to 1 in the Stacks implementation)

fixed, thx.


pkg/server/status_test.go, line 495 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Any reason you moved this code block? Code movement makes reviews difficult by obscuring the diff. Could you move this back up above TestStatusJson?

The reason was to keep grpc tests close to each other, but i see that there are other grpc tests scattered over the file. Moved back.


pkg/server/status_test.go, line 500 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// Skipping ... (*statusServer).Stacks reads goroutine ....

Are you saying that the problem is that by the time you dump stack sometimes pebble hasn't managed to write the labels yet? I think that race has nothing to do with stressrace; this can in theory always happen and the solution would be to use testutils.SucceedsSoon in the test below (i.e. after a little bit of waiting we see the labels). With that done, I don't see a reason for this test to be flaky.

The test is not flaky it fails everytime make testrace PKG=./pkg/server TESTS=TestStatusLocalStacks:

WARNING: DATA RACE
Read at 0x00c000010020 by goroutine 339:
  runtime/pprof.(*labelMap).String()
      /usr/local/go/src/runtime/pprof/label.go:46 +0x53
  runtime/pprof.printCountProfile.func1()
      /usr/local/go/src/runtime/pprof/pprof.go:415 +0x178
  runtime/pprof.printCountProfile()
      /usr/local/go/src/runtime/pprof/pprof.go:424 +0x2f1
  runtime/pprof.writeRuntimeProfile()
      /usr/local/go/src/runtime/pprof/pprof.go:732 +0x1b8
  runtime/pprof.writeGoroutine()
      /usr/local/go/src/runtime/pprof/pprof.go:684 +0x8f
  runtime/pprof.(*Profile).WriteTo()
      /usr/local/go/src/runtime/pprof/pprof.go:331 +0x1b0
  github.com/cockroachdb/cockroach/pkg/server.(*statusServer).Stacks()
...
Previous write at 0x00c000010020 by goroutine 38:
  runtime/pprof.WithLabels()
      /usr/local/go/src/runtime/pprof/label.go:60 +0x88
  runtime/pprof.Do()
      /usr/local/go/src/runtime/pprof/runtime.go:38 +0xf2
  github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop()
      /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:292 +0x91
  github.com/cockroachdb/pebble.(*tableCacheShard).init·dwrap·51()
      /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:284 +0x39

it's standard go data race checker.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to touch up the test, then good to go.

Reviewed 6 of 6 files at r4, all commit messages.
Dismissed @dt from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @mneverov)


pkg/server/status_test.go, line 495 at r3 (raw file):

Previously, mneverov (Max Neverov) wrote…

The reason was to keep grpc tests close to each other, but i see that there are other grpc tests scattered over the file. Moved back.

Ah, ok. Well, if you are moving code, you can do it in a separate commit, but I would probably just leave it in its original place in this case, for simplicity.


pkg/server/status_test.go, line 500 at r3 (raw file):

Previously, mneverov (Max Neverov) wrote…

The test is not flaky it fails everytime make testrace PKG=./pkg/server TESTS=TestStatusLocalStacks:

WARNING: DATA RACE
Read at 0x00c000010020 by goroutine 339:
  runtime/pprof.(*labelMap).String()
      /usr/local/go/src/runtime/pprof/label.go:46 +0x53
  runtime/pprof.printCountProfile.func1()
      /usr/local/go/src/runtime/pprof/pprof.go:415 +0x178
  runtime/pprof.printCountProfile()
      /usr/local/go/src/runtime/pprof/pprof.go:424 +0x2f1
  runtime/pprof.writeRuntimeProfile()
      /usr/local/go/src/runtime/pprof/pprof.go:732 +0x1b8
  runtime/pprof.writeGoroutine()
      /usr/local/go/src/runtime/pprof/pprof.go:684 +0x8f
  runtime/pprof.(*Profile).WriteTo()
      /usr/local/go/src/runtime/pprof/pprof.go:331 +0x1b0
  github.com/cockroachdb/cockroach/pkg/server.(*statusServer).Stacks()
...
Previous write at 0x00c000010020 by goroutine 38:
  runtime/pprof.WithLabels()
      /usr/local/go/src/runtime/pprof/label.go:60 +0x88
  runtime/pprof.Do()
      /usr/local/go/src/runtime/pprof/runtime.go:38 +0xf2
  github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop()
      /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:292 +0x91
  github.com/cockroachdb/pebble.(*tableCacheShard).init·dwrap·51()
      /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:284 +0x39

it's standard go data race checker.

Interesting. This data race makes absolutely no sense to me. WithLabels makes a copy of the incoming LabelSet, and Do passes the resulting ctx to SetGoroutineLabels only after WithLabels has returned. I was able to reproduce with just this simple test:

var labelSet = pprof.Labels("foo", "bar")

func TestRace(t *testing.T) {
	done := make(chan struct{})
	go func() {
		pprof.Do(context.Background(), labelSet, func(ctx context.Context) {
			<-done
		})
	}()
	var buf bytes.Buffer
	require.NoError(t, pprof.Lookup("goroutine").WriteTo(&buf, 1))
}

I agree with skipping for now. My interest is piqued, but it shouldn't hold up this PR.


pkg/server/status_test.go, line 98 at r4 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	// skipping because statusServer#Stacks reads goroutine labels simultaneously

Old comment about capitalization and (*statusServer).Stack still applies (same for pebble ref below).


pkg/server/status_test.go, line 100 at r4 (raw file):

	// skipping because statusServer#Stacks reads goroutine labels simultaneously
	// with pebble/tableCacheShard#releaseLoop writes to the goroutine labels.
	skip.UnderRace(t)

skip.UnderRaceWithIssue(t, 74133) (I just filed #74133).


pkg/server/status_test.go, line 102 at r4 (raw file):

	skip.UnderRace(t)

	tempDir, cleanupFn := testutils.TempDir(t)

t.TempDir() is simpler (we still need to remove testutils.TempDir, it is no longer needed).


pkg/server/status_test.go, line 105 at r4 (raw file):

	defer cleanupFn()

	storeSpec := base.StoreSpec{Path: tempDir}

Why is this even needed? Why doesn't starting a vanilla test server work?

Consider dramatically simplifying this test using this snippet:

	ctx := context.Background()
	var args base.TestClusterArgs
	tc := testcluster.StartTestCluster(t, 1 /* nodes */, args)
	defer tc.Stopper().Stop(ctx)
	
	cc, err := tc.GetStatusClient(ctx, t, 0 /* idx */)
	require.NoError(t, err)
	_ = cc.Stacks

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status_test.go, line 495 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ah, ok. Well, if you are moving code, you can do it in a separate commit, but I would probably just leave it in its original place in this case, for simplicity.

👍


pkg/server/status_test.go, line 98 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Old comment about capitalization and (*statusServer).Stack still applies (same for pebble ref below).

Couldn't find comment about capitalization, so I removed the comment because of the linked issue.


pkg/server/status_test.go, line 105 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why is this even needed? Why doesn't starting a vanilla test server work?

Consider dramatically simplifying this test using this snippet:

	ctx := context.Background()
	var args base.TestClusterArgs
	tc := testcluster.StartTestCluster(t, 1 /* nodes */, args)
	defer tc.Stopper().Stop(ctx)
	
	cc, err := tc.GetStatusClient(ctx, t, 0 /* idx */)
	require.NoError(t, err)
	_ = cc.Stacks

Can't use testcluster because it creates a circular dependency: server (status_test is in server package) -> testcluster -> server.
Removed unnecessary tempDir.

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status_test.go, line 500 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Interesting. This data race makes absolutely no sense to me. WithLabels makes a copy of the incoming LabelSet, and Do passes the resulting ctx to SetGoroutineLabels only after WithLabels has returned. I was able to reproduce with just this simple test:

var labelSet = pprof.Labels("foo", "bar")

func TestRace(t *testing.T) {
	done := make(chan struct{})
	go func() {
		pprof.Do(context.Background(), labelSet, func(ctx context.Context) {
			<-done
		})
	}()
	var buf bytes.Buffer
	require.NoError(t, pprof.Lookup("goroutine").WriteTo(&buf, 1))
}

I agree with skipping for now. My interest is piqued, but it shouldn't hold up this PR.

now several cli tests failed because of the same reason. Added skip.UnderRaceWithIssue(t, 74133) to them.

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status_ext_test.go, line 30 at r8 (raw file):

// TestStatusLocalStacks verifies that goroutine stack traces are available
// via the /_status/stacks/local endpoint.
func TestStatusLocalStacks(t *testing.T) {

@tbg thx for helping me out with this.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status_ext_test.go, line 30 at r8 (raw file):

Previously, mneverov (Max Neverov) wrote…

@tbg thx for helping me out with this.

Np! Sorry, meant to submit a comment about this but forgot to actually send it.


pkg/server/status_test.go, line 500 at r3 (raw file):

Previously, mneverov (Max Neverov) wrote…

now several cli tests failed because of the same reason. Added skip.UnderRaceWithIssue(t, 74133) to them.

Ugh, that is a bummer, but I see that it's necessary. I fiddled a bit with my example above to see if we can work around the false positive somehow, but I don't see how. There are probably more tests that we would need to skip. To make it obvious which ones might catch the race (and to avoid the race in tests that are not sensitive to whether the ?debug=1 is included or not), can you make the collection of the ?debug=1 file conditional on util.RaceEnabled? If this is false, do the current thing. if this is true, instead of hitting the Stacks endpoint, write the same file but the contents will just be "disabled in race mode, see ".

I think then you can remove all of the race-unskips except for the one in the main test in this PR (I think, or at least we will see in CI, deterministically). You'll want to comment on the issue that this conditional can be removed once the issue is closed.


pkg/server/status_test.go, line 105 at r4 (raw file):

Previously, mneverov (Max Neverov) wrote…

Can't use testcluster because it creates a circular dependency: server (status_test is in server package) -> testcluster -> server.
Removed unnecessary tempDir.

Ah, but then the solution is to move this test to a file that belongs to package server_test. I pushed a commit to your branch to that effect and made the simplification.

Copy link
Copy Markdown
Contributor Author

@mneverov mneverov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)


pkg/server/status_test.go, line 500 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ugh, that is a bummer, but I see that it's necessary. I fiddled a bit with my example above to see if we can work around the false positive somehow, but I don't see how. There are probably more tests that we would need to skip. To make it obvious which ones might catch the race (and to avoid the race in tests that are not sensitive to whether the ?debug=1 is included or not), can you make the collection of the ?debug=1 file conditional on util.RaceEnabled? If this is false, do the current thing. if this is true, instead of hitting the Stacks endpoint, write the same file but the contents will just be "disabled in race mode, see ".

I think then you can remove all of the race-unskips except for the one in the main test in this PR (I think, or at least we will see in CI, deterministically). You'll want to comment on the issue that this conditional can be removed once the issue is closed.

Added the condition. Will update the issue with the link to the file/line after the PR is merged to master.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: there was one more lint failure which I just fixed.
Thanks a bunch, Max! There was a lint and codegen failure but I just amended this PR, should be good to go now.

bors r+

Reviewed 2 of 2 files at r7, 8 of 8 files at r8, 2 of 2 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)

@tbg tbg self-requested a review December 22, 2021 15:14
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 22, 2021

Build succeeded:

@craig craig bot merged commit 8307814 into cockroachdb:master Dec 22, 2021
@mneverov
Copy link
Copy Markdown
Contributor Author

there was one more lint failure which I just fixed.

sorry about extra work and thanks for your help!

@dt
Copy link
Copy Markdown
Contributor

dt commented Feb 14, 2022

blathers backport 21.2

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Feb 14, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ad4ba0a to blathers/backport-release-21.2-74070: POST https://api.github.com/repos/cockroachlabs/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cli/debug: include ?debug=1 goroutine dump in debug.zip

4 participants