server: expose goroutine dump with labels in cli/debug#74070
server: expose goroutine dump with labels in cli/debug#74070craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt)
a discussion (no related file):
@dt ptal.
I understood the task as "do not replace existing goroutine dump with format?debug=2, but add another with format ?debug=1. That's why I just plumbed the new dump to the existing one and separated them with ----.
It can be more sophisticated and a new flag can be added cockroach debug zip --goroutine-dump-with-labels. It would require introducing a new stacks type like StacksType_GOROUTINE_STACKS_LABELS. Don't know the usecase, hence chose the simplest implementation.
The test TestStatusLocalStacks was failing under the testrace because it reads goroutine labels at the same time pebble table_cache is adding/deleting them. Since this functionality (debug command) is only for debugging, I skipped the test under race. I don't see other options tbh.
Finally, I noticed goroutinedumper that does the same with the old ?debug=2 format. Does it need to be fixed too?
|
@mneverov Thanks for picking this up! You reading matches my intent -- don't replace the the current debug=2 output which we generally find most useful, but add the debug=1 output along-side. That said, I do wonder if just concatenating it as is currently done here might lead to a little bit of confusion in our current debugging workflows -- we often, when debugging, grep these files to answer questions like "how many threads are running x" or something, and having both in the same file could potentially be confusing? Maybe something we could just adjust to, but I wonder how hard it'd be to make a separate file, side-by-side? I suppose that'd require parameterizing the endpoint as to which it returns, and then having cli/debug fetch them separately, to separate files -- I'm not sure if that's a practical suggestion or not. If not, we could see how people who use these regularly (our internal support folks) would feel about the single combined file as it stands here. Does a separate file seem like it'd be much more complex? |
|
Thanks for working on this, Max! Haven't reviewed the code but my take on the above comments is that we would want two separate files (where the current goroutine dump file keeps the same name and the ?debug=1 version gets a new name). |
mneverov
left a comment
There was a problem hiding this comment.
Added debug parameter to the stack endpoint.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/server/status.go
Outdated
| } | ||
| return &serverpb.JSONResponse{Data: buf[:length]}, nil | ||
| var buf bytes.Buffer | ||
|
|
There was a problem hiding this comment.
I think we might want to default to 2 if req.Debug is 0, e.g.
debug := 2
if req.Debug > 0 {
debug = int(req.Debug)
}So that if an older client, that doesn't set req.Debug at all, is asking for stacks from a new node, it still gets what it expects (2, rather than 0) ?
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt)
pkg/server/status.go, line 1235 at r2 (raw file):
Previously, dt (David Taylor) wrote…
I think we might want to default to
2if req.Debug is 0, e.g.debug := 2 if req.Debug > 0 { debug = int(req.Debug) }So that if an older client, that doesn't set req.Debug at all, is asking for stacks from a new node, it still gets what it expects (2, rather than 0) ?
Agree, I was thinking about introducing a enum but decided against. Fixed as suggested.
tbg
left a comment
There was a problem hiding this comment.
Looks good! Had some suggestions, but don't think this is more than a round of two of reviews away from being able to merge, thank you for taking this on!
Reviewed 5 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status.go, line 1239 at r3 (raw file):
debug = int(req.Debug) } if err = pprof.Lookup("goroutine").WriteTo(&buf, debug); err != nil {
if err :=
pkg/server/serverpb/status.proto, line 570 at r3 (raw file):
string node_id = 1; StacksType type = 2; int64 debug = 3;
Isn't this what the StacksType should already communicate? Right now, there is only one enum value:
cockroach/pkg/server/serverpb/status.proto
Lines 560 to 563 in 00cf436
There used to be another one back in the day (when we had heavier C dependencies) but I think at this point it would be reasonable to introduce a second enum value
GOROUTINE_STACKS_DEBUG_1 (and add a comment that GOROUTINE_STACKS corresponds to GOROUTINE_STACKS_DEBUG_2; I think we probably want to avoid actually renaming the option since it might be used through the JSON API).
This would then replace the debug flag (we'd map GOROUTINE_STACKS to 2, and GOROUTINE_STACKS_DEBUG_1 to 1 in the Stacks implementation)
pkg/server/status_test.go, line 495 at r3 (raw file):
} // TestStatusLocalStacks verifies that goroutine stack traces are available
Any reason you moved this code block? Code movement makes reviews difficult by obscuring the diff. Could you move this back up above TestStatusJson?
pkg/server/status_test.go, line 500 at r3 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) // skipping because statusServer#Stacks reads goroutine labels simultaneously
// Skipping ... (*statusServer).Stacks reads goroutine ....
Are you saying that the problem is that by the time you dump stack sometimes pebble hasn't managed to write the labels yet? I think that race has nothing to do with stressrace; this can in theory always happen and the solution would be to use testutils.SucceedsSoon in the test below (i.e. after a little bit of waiting we see the labels). With that done, I don't see a reason for this test to be flaky.
b0accb6 to
ff21e79
Compare
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/serverpb/status.proto, line 570 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this what the StacksType should already communicate? Right now, there is only one enum value:
cockroach/pkg/server/serverpb/status.proto
Lines 560 to 563 in 00cf436
There used to be another one back in the day (when we had heavier C dependencies) but I think at this point it would be reasonable to introduce a second enum value
GOROUTINE_STACKS_DEBUG_1(and add a comment that GOROUTINE_STACKS corresponds to GOROUTINE_STACKS_DEBUG_2; I think we probably want to avoid actually renaming the option since it might be used through the JSON API).This would then replace the
debugflag (we'd map GOROUTINE_STACKS to 2, and GOROUTINE_STACKS_DEBUG_1 to 1 in theStacksimplementation)
fixed, thx.
pkg/server/status_test.go, line 495 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Any reason you moved this code block? Code movement makes reviews difficult by obscuring the diff. Could you move this back up above
TestStatusJson?
The reason was to keep grpc tests close to each other, but i see that there are other grpc tests scattered over the file. Moved back.
pkg/server/status_test.go, line 500 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// Skipping ... (*statusServer).Stacks reads goroutine ....
Are you saying that the problem is that by the time you dump stack sometimes pebble hasn't managed to write the labels yet? I think that race has nothing to do with stressrace; this can in theory always happen and the solution would be to use
testutils.SucceedsSoonin the test below (i.e. after a little bit of waiting we see the labels). With that done, I don't see a reason for this test to be flaky.
The test is not flaky it fails everytime make testrace PKG=./pkg/server TESTS=TestStatusLocalStacks:
WARNING: DATA RACE
Read at 0x00c000010020 by goroutine 339:
runtime/pprof.(*labelMap).String()
/usr/local/go/src/runtime/pprof/label.go:46 +0x53
runtime/pprof.printCountProfile.func1()
/usr/local/go/src/runtime/pprof/pprof.go:415 +0x178
runtime/pprof.printCountProfile()
/usr/local/go/src/runtime/pprof/pprof.go:424 +0x2f1
runtime/pprof.writeRuntimeProfile()
/usr/local/go/src/runtime/pprof/pprof.go:732 +0x1b8
runtime/pprof.writeGoroutine()
/usr/local/go/src/runtime/pprof/pprof.go:684 +0x8f
runtime/pprof.(*Profile).WriteTo()
/usr/local/go/src/runtime/pprof/pprof.go:331 +0x1b0
github.com/cockroachdb/cockroach/pkg/server.(*statusServer).Stacks()
...
Previous write at 0x00c000010020 by goroutine 38:
runtime/pprof.WithLabels()
/usr/local/go/src/runtime/pprof/label.go:60 +0x88
runtime/pprof.Do()
/usr/local/go/src/runtime/pprof/runtime.go:38 +0xf2
github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop()
/home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:292 +0x91
github.com/cockroachdb/pebble.(*tableCacheShard).init·dwrap·51()
/home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:284 +0x39
it's standard go data race checker.
tbg
left a comment
There was a problem hiding this comment.
Just need to touch up the test, then good to go.
Reviewed 6 of 6 files at r4, all commit messages.
Dismissed @dt from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt and @mneverov)
pkg/server/status_test.go, line 495 at r3 (raw file):
Previously, mneverov (Max Neverov) wrote…
The reason was to keep grpc tests close to each other, but i see that there are other grpc tests scattered over the file. Moved back.
Ah, ok. Well, if you are moving code, you can do it in a separate commit, but I would probably just leave it in its original place in this case, for simplicity.
pkg/server/status_test.go, line 500 at r3 (raw file):
Previously, mneverov (Max Neverov) wrote…
The test is not flaky it fails everytime
make testrace PKG=./pkg/server TESTS=TestStatusLocalStacks:WARNING: DATA RACE Read at 0x00c000010020 by goroutine 339: runtime/pprof.(*labelMap).String() /usr/local/go/src/runtime/pprof/label.go:46 +0x53 runtime/pprof.printCountProfile.func1() /usr/local/go/src/runtime/pprof/pprof.go:415 +0x178 runtime/pprof.printCountProfile() /usr/local/go/src/runtime/pprof/pprof.go:424 +0x2f1 runtime/pprof.writeRuntimeProfile() /usr/local/go/src/runtime/pprof/pprof.go:732 +0x1b8 runtime/pprof.writeGoroutine() /usr/local/go/src/runtime/pprof/pprof.go:684 +0x8f runtime/pprof.(*Profile).WriteTo() /usr/local/go/src/runtime/pprof/pprof.go:331 +0x1b0 github.com/cockroachdb/cockroach/pkg/server.(*statusServer).Stacks() ... Previous write at 0x00c000010020 by goroutine 38: runtime/pprof.WithLabels() /usr/local/go/src/runtime/pprof/label.go:60 +0x88 runtime/pprof.Do() /usr/local/go/src/runtime/pprof/runtime.go:38 +0xf2 github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop() /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:292 +0x91 github.com/cockroachdb/pebble.(*tableCacheShard).init·dwrap·51() /home/mneverov/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:284 +0x39it's standard go data race checker.
Interesting. This data race makes absolutely no sense to me. WithLabels makes a copy of the incoming LabelSet, and Do passes the resulting ctx to SetGoroutineLabels only after WithLabels has returned. I was able to reproduce with just this simple test:
var labelSet = pprof.Labels("foo", "bar")
func TestRace(t *testing.T) {
done := make(chan struct{})
go func() {
pprof.Do(context.Background(), labelSet, func(ctx context.Context) {
<-done
})
}()
var buf bytes.Buffer
require.NoError(t, pprof.Lookup("goroutine").WriteTo(&buf, 1))
}I agree with skipping for now. My interest is piqued, but it shouldn't hold up this PR.
pkg/server/status_test.go, line 98 at r4 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) // skipping because statusServer#Stacks reads goroutine labels simultaneously
Old comment about capitalization and (*statusServer).Stack still applies (same for pebble ref below).
pkg/server/status_test.go, line 100 at r4 (raw file):
// skipping because statusServer#Stacks reads goroutine labels simultaneously // with pebble/tableCacheShard#releaseLoop writes to the goroutine labels. skip.UnderRace(t)
skip.UnderRaceWithIssue(t, 74133) (I just filed #74133).
pkg/server/status_test.go, line 102 at r4 (raw file):
skip.UnderRace(t) tempDir, cleanupFn := testutils.TempDir(t)
t.TempDir() is simpler (we still need to remove testutils.TempDir, it is no longer needed).
pkg/server/status_test.go, line 105 at r4 (raw file):
defer cleanupFn() storeSpec := base.StoreSpec{Path: tempDir}
Why is this even needed? Why doesn't starting a vanilla test server work?
Consider dramatically simplifying this test using this snippet:
ctx := context.Background()
var args base.TestClusterArgs
tc := testcluster.StartTestCluster(t, 1 /* nodes */, args)
defer tc.Stopper().Stop(ctx)
cc, err := tc.GetStatusClient(ctx, t, 0 /* idx */)
require.NoError(t, err)
_ = cc.Stacks
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status_test.go, line 495 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ah, ok. Well, if you are moving code, you can do it in a separate commit, but I would probably just leave it in its original place in this case, for simplicity.
👍
pkg/server/status_test.go, line 98 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Old comment about capitalization and
(*statusServer).Stackstill applies (same for pebble ref below).
Couldn't find comment about capitalization, so I removed the comment because of the linked issue.
pkg/server/status_test.go, line 105 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why is this even needed? Why doesn't starting a vanilla test server work?
Consider dramatically simplifying this test using this snippet:
ctx := context.Background() var args base.TestClusterArgs tc := testcluster.StartTestCluster(t, 1 /* nodes */, args) defer tc.Stopper().Stop(ctx) cc, err := tc.GetStatusClient(ctx, t, 0 /* idx */) require.NoError(t, err) _ = cc.Stacks
Can't use testcluster because it creates a circular dependency: server (status_test is in server package) -> testcluster -> server.
Removed unnecessary tempDir.
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status_test.go, line 500 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Interesting. This data race makes absolutely no sense to me.
WithLabelsmakes a copy of the incomingLabelSet, andDopasses the resultingctxtoSetGoroutineLabelsonly afterWithLabelshas returned. I was able to reproduce with just this simple test:var labelSet = pprof.Labels("foo", "bar") func TestRace(t *testing.T) { done := make(chan struct{}) go func() { pprof.Do(context.Background(), labelSet, func(ctx context.Context) { <-done }) }() var buf bytes.Buffer require.NoError(t, pprof.Lookup("goroutine").WriteTo(&buf, 1)) }I agree with skipping for now. My interest is piqued, but it shouldn't hold up this PR.
now several cli tests failed because of the same reason. Added skip.UnderRaceWithIssue(t, 74133) to them.
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status_ext_test.go, line 30 at r8 (raw file):
// TestStatusLocalStacks verifies that goroutine stack traces are available // via the /_status/stacks/local endpoint. func TestStatusLocalStacks(t *testing.T) {
@tbg thx for helping me out with this.
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status_ext_test.go, line 30 at r8 (raw file):
Previously, mneverov (Max Neverov) wrote…
@tbg thx for helping me out with this.
Np! Sorry, meant to submit a comment about this but forgot to actually send it.
pkg/server/status_test.go, line 500 at r3 (raw file):
Previously, mneverov (Max Neverov) wrote…
now several cli tests failed because of the same reason. Added
skip.UnderRaceWithIssue(t, 74133)to them.
Ugh, that is a bummer, but I see that it's necessary. I fiddled a bit with my example above to see if we can work around the false positive somehow, but I don't see how. There are probably more tests that we would need to skip. To make it obvious which ones might catch the race (and to avoid the race in tests that are not sensitive to whether the ?debug=1 is included or not), can you make the collection of the ?debug=1 file conditional on util.RaceEnabled? If this is false, do the current thing. if this is true, instead of hitting the Stacks endpoint, write the same file but the contents will just be "disabled in race mode, see ".
I think then you can remove all of the race-unskips except for the one in the main test in this PR (I think, or at least we will see in CI, deterministically). You'll want to comment on the issue that this conditional can be removed once the issue is closed.
pkg/server/status_test.go, line 105 at r4 (raw file):
Previously, mneverov (Max Neverov) wrote…
Can't use
testclusterbecause it creates a circular dependency: server (status_test is in server package) -> testcluster -> server.
Removed unnecessarytempDir.
Ah, but then the solution is to move this test to a file that belongs to package server_test. I pushed a commit to your branch to that effect and made the simplification.
mneverov
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
pkg/server/status_test.go, line 500 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ugh, that is a bummer, but I see that it's necessary. I fiddled a bit with my example above to see if we can work around the false positive somehow, but I don't see how. There are probably more tests that we would need to skip. To make it obvious which ones might catch the race (and to avoid the race in tests that are not sensitive to whether the ?debug=1 is included or not), can you make the collection of the
?debug=1file conditional onutil.RaceEnabled? If this is false, do the current thing. if this is true, instead of hitting theStacksendpoint, write the same file but the contents will just be "disabled in race mode, see ".I think then you can remove all of the race-unskips except for the one in the main test in this PR (I think, or at least we will see in CI, deterministically). You'll want to comment on the issue that this conditional can be removed once the issue is closed.
Added the condition. Will update the issue with the link to the file/line after the PR is merged to master.
Fixes: cockroachdb#73539 Release note: None
tbg
left a comment
There was a problem hiding this comment.
there was one more lint failure which I just fixed.
Thanks a bunch, Max! There was a lint and codegen failure but I just amended this PR, should be good to go now.
bors r+
Reviewed 2 of 2 files at r7, 8 of 8 files at r8, 2 of 2 files at r9.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @dt, @mneverov, and @tbg)
|
Build succeeded: |
sorry about extra work and thanks for your help! |
|
blathers backport 21.2 |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ad4ba0a to blathers/backport-release-21.2-74070: POST https://api.github.com/repos/cockroachlabs/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
server: expose goroutine dump with labels in cli/debug
Fixes: #73539
Release note: None