Make testserver multi-node more robust.#148
Conversation
RichardJCai
commented
Sep 21, 2022
- Fix race condition in pollListeningURLFile.
- Add Opts for init node timeout and poll listening url.
- InitTimeoutOpt, PollListenURLTimeoutOpt
- Parallelize restart test
a98fbfc to
fc0e44c
Compare
- Fix race condition in pollListeningURLFile. - Add Opts for init node timeout and poll listening url. - InitTimeoutOpt, PollListenURLTimeoutOpt - Parallelize restart test
fc0e44c to
dd2bfb9
Compare
pawalt
left a comment
There was a problem hiding this comment.
Is there any way we can test the added arguments? Maybe make a testserver configuration we know will timeout on init and time it to make sure the option is propogated properly?
Reviewed all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)
testserver/testserver.go line 814 at r1 (raw file):
ts.mu.RLock() nodeDir := fmt.Sprintf("%s%d", ts.baseDir, i)
Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?
testserver/testserver_test.go line 463 at r1 (raw file):
for j := 0; j < 3; j++ { for { port, err := getFreePort()
This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.
I think if you just feed the testserver :0 as the port to listen on, it'll start up guaranteeing port uniqueness, and you won't have to do any of this allocation.
|
|
RichardJCai
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @pawalt, @rafiss, and @ZhouXing19)
testserver/testserver.go line 814 at r1 (raw file):
Previously, pawalt (Peyton Walters) wrote…
Why do we need to remove the individual directories if we're going to removeAll after anyway? Is there some case in which we'd fail to remove a single node's directory but not the others?
They're separate directories, not nested. Ie the base one looks like cockroach-testserver123 and the node directories are cockroach-testserver1230, cockroach-testserver1231...
testserver/testserver_test.go line 463 at r1 (raw file):
This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.
Yeah I thought about using :0 but we need to know what address to feed into the --join flag which includes the port. Pretty annoying problem here.
pawalt
left a comment
There was a problem hiding this comment.
LGTM. Your call on testing the new arguments. Functionality looks pretty obvious, but never hurts to have some tests.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rafiss, @RichardJCai, and @ZhouXing19)
testserver/testserver_test.go line 463 at r1 (raw file):
Previously, RichardJCai (Richard Cai) wrote…
This is a very edge case, but this could race if another process is attempting to get a port from:0 at the same time. If that process does a TCP listen on :0 between when these ports are allocated and the testserver is started, it could get assigned one of the ports you've allocated here.
Yeah I thought about using
:0but we need to know what address to feed into the--joinflag which includes the port. Pretty annoying problem here.
Hmmm that is annoying. Thanks for the clarification.
I'm allergic to adding tests for timeouts. Jk I'll add a simple one. |
|
Actually I'm gonna merge this as is, adding a testhook for timeout is somewhat annoying and I'm gonna try to get the testserver into CRDB before I leave. |