Skip to content

acceptance: port the PauseUnpause test to use real kafka in docker#27134

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
danhhz:cdc_acceptance
Jul 9, 2018
Merged

acceptance: port the PauseUnpause test to use real kafka in docker#27134
craig[bot] merged 1 commit intocockroachdb:masterfrom
danhhz:cdc_acceptance

Conversation

@danhhz
Copy link
Copy Markdown
Contributor

@danhhz danhhz commented Jul 3, 2018

There's enough complexity in Kafka that there's no way to have any
confidence in end-to-end correctness testing based on a mock. We need
real Kafka. Both kafka and its zookeeper dependence are java, and so we
need to run them in Docker for these tests to be portable and
reproducible. (I know, I know.)

The major trick here is that kafka has an internal mechanism that lets
you talk to any node and it redirects you to the one that has the data
you're looking for. This is also used internally by the system. These
are configured as advertised hosts.

So, our CockroachDB changefeed needs to be able to reach Kafka at some
address configured in the CREATE CHANGEFEED. Then it receives the list
of advertised host:ports from Kafka, and it needs to be able to contact
all of those. The same is true for the kafka nodes and the consumer that
the test uses for to make assertions. Docker for mac really makes this
difficult. The kafka nodes are running in docker, so they want to be
able to talk to each other by their docker hostnames. We could also run
CockroachDB inside docker, but the test is running outside (where the
docker hosts don't resolve) and there's no easy way to change that.

The easiest thing would be docker's --network=host, but alas that's
not available with docker for mac.

Kafka (theoretically) allows for multiple sets of named advertised
listeners, differentiated by port. When you connect, it sends back the
ones relevant to the port you connected to. This is designed for exactly
this sort of situation. But this is tragically underdocumented and after
literally tens of hours I could not get this to work.

We could run some program inside docker to consume and proxy that
information out to the test somehow (e.g. tailing
kafka-console-consumer), but this is likely to introduce the same sort
of bugs we'd have with the mock.

We could also use --network=container instead of bridge networking,
which lets us share the same network namespace between a bunch of
containers. Then we make everything run on unique ports (and export
them) and use "localhost" for the host everywhere. This works well and
might be what we have to do if we need to run multi-node Kafka clusters.
However, all the necessary ports have to be exported from the first
container started, which requires some major surgery to DockerCluster.

In the end, what we do is similar. Zookeeper and Kafka are assigned
unique ports that are unassigned on the host. They run on that port
inside docker and it's mapped to the same port on the host. (Zookeeper
doesn't need to be available externally, but it was easy and sometimes
it's nice for debugging the test.) A one node Kafka cluster can talk to
itself on localhost and the unique port. CockroachDB also can, but only
from outside docker. And... uh... we're done. \o/

@danhhz danhhz added the do-not-merge bors won't merge a PR with this label. label Jul 3, 2018
@danhhz danhhz requested a review from a team July 3, 2018 18:30
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@danhhz danhhz requested a review from a team July 3, 2018 22:52
@danhhz danhhz force-pushed the cdc_acceptance branch 8 times, most recently from edc4dc6 to e75e746 Compare July 4, 2018 00:09
@danhhz danhhz changed the title [WIP] acceptance: port the PauseUnpause test to use real kafka in docker acceptance: port the PauseUnpause test to use real kafka in docker Jul 4, 2018
@danhhz danhhz removed the do-not-merge bors won't merge a PR with this label. label Jul 4, 2018
@danhhz danhhz requested review from nvb and tbg July 4, 2018 00:19
@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jul 4, 2018

I think this is (finally) ready to go.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take on CI? I think you might be able to cut down on it by only setting up a single pod that is shared across the tests (setup time is the big problem in these tests). I've written this below, but this seems.. quite complex to set up, much more complex than something I'd like to do in regular CI. OTOH, I've always disliked how roachtest only catches problems after they're merged, so we don't really have any other options as you pointed out.

Reviewed 17 of 17 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/acceptanceccl/.gitignore, line 6 at r1 (raw file):

*.backup
*.tfstate
# TODO(tschottdorf): looks unused. Removable?

Remove it. I think the certs are in .localcluster*?


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 58 at r1 (raw file):

	defer k.Close(ctx)

	defer func(prev time.Duration) { jobs.DefaultAdoptInterval = prev }(jobs.DefaultAdoptInterval)

I guess you gotta do what you gotta do, but doing this from an acceptance test leaves a pretty bad taste in my mouth. How about an env var?


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 121 at r1 (raw file):

}

func getOpenPort() (string, error) {

You know this, but this doesn't guarantee that the port is open - just makes it more likely. I think in TC we've seen ports grabbed by other processes now that we're using fixed ports to run bare-metal CockroachDB in some acceptance tests.


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 181 at r1 (raw file):

// we're done. \o/
//
// This is a monstrosity, so please fix it if you can figure out a better way.

Just make it a roachtest? I've become more critical over time about overly complex tests like these. The acceptance package already took upwards of 20min for a cycle last I checked, and I doubt it has gone down since.

There's enough complexity in Kafka that there's no way to have any
confidence in end-to-end correctness testing based on a mock. We need
real Kafka. Both kafka and its zookeeper dependence are java, and so we
need to run them in Docker for these tests to be portable and
reproducible. (I know, I know.)

The major trick here is that kafka has an internal mechanism that lets
you talk to any node and it redirects you to the one that has the data
you're looking for. This is also used internally by the system. These
are configured as advertised hosts.

So, our CockroachDB changefeed needs to be able to reach Kafka at some
address configured in the CREATE CHANGEFEED. Then it receives the list
of advertised host:ports from Kafka, and it needs to be able to contact
all of those. The same is true for the kafka nodes and the consumer that
the test uses for to make assertions. Docker for mac really makes this
difficult. The kafka nodes are running in docker, so they want to be
able to talk to each other by their docker hostnames. We could also run
CockroachDB inside docker, but the test is running outside (where the
docker hosts don't resolve) and there's no easy way to change that.

The easiest thing would be docker's `--network=host`, but alas that's
not available with docker for mac.

Kafka (theoretically) allows for multiple sets of named advertised
listeners, differentiated by port. When you connect, it sends back the
ones relevant to the port you connected to. This is designed for exactly
this sort of situation. But this is tragically underdocumented and after
*literally tens of hours* I could not get this to work.

We could run some program inside docker to consume and proxy that
information out to the test somehow (e.g. tailing
`kafka-console-consumer`), but this is likely to introduce the same sort
of bugs we'd have with the mock.

We could also use `--network=container` instead of bridge networking,
which lets us share the same network namespace between a bunch of
containers. Then we make everything run on unique ports (and export
them) and use "localhost" for the host everywhere. This works well and
might be what we have to do if we need to run multi-node Kafka clusters.
However, all the necessary ports have to be exported from the first
container started, which requires some major surgery to DockerCluster.

In the end, what we do is similar. Zookeeper and Kafka are assigned
unique ports that are unassigned on the host. They run on that port
inside docker and it's mapped to the same port on the host. (Zookeeper
doesn't need to be available externally, but it was easy and sometimes
it's nice for debugging the test.) A one node Kafka cluster can talk to
itself on localhost and the unique port. CockroachDB also can, but only
from outside docker. And... uh... we're done. \o/

Release note: None
Copy link
Copy Markdown
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roachtest is really good at testing large scale behavior, but I don't think it replaces everything that the acceptance package does and I don't think it should. The "remote" tests have gotten ripped out of acceptance, but we've never really gone and cleaned it up to specialize in the sort of small end-to-end tests that it could/should be for now.

Anyway, some additionally backstory. I'm about to switch to the async kafka producer, but I can't be confident in the correctness of the tests when we're using my fake kafka thingy instead of real kafka. (I also have an upcoming PR that verifies that we stick to the ordering/partitioning guarantees that we've decided on that needs a real kafka.) So I'm pretty blocked until this merges.

I'm totally open to revisiting how all this works, but at the moment, I really need to make progress on the impl. If you really don't like how this works, can we merge it as-is, continue having this discussion, and then I'll promise to address it in the testing period after the first beta gets cut?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/acceptanceccl/.gitignore, line 6 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Remove it. I think the certs are in .localcluster*?

Done.


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 58 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I guess you gotta do what you gotta do, but doing this from an acceptance test leaves a pretty bad taste in my mouth. How about an env var?

Heh. This is code movement and I already have an issue for it from the review of when it was introduced : - ) #25622


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 121 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You know this, but this doesn't guarantee that the port is open - just makes it more likely. I think in TC we've seen ports grabbed by other processes now that we're using fixed ports to run bare-metal CockroachDB in some acceptance tests.

I don't know actually. My understanding from stack overflow is that this gets an open port, gives it up, and then the caller immediately uses it. So there's a small race, but that seems fine. It seems to have been working that way in practice, picking ports up in the 50000s on my laptop. Is that not the case?


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 181 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just make it a roachtest? I've become more critical over time about overly complex tests like these. The acceptance package already took upwards of 20min for a cycle last I checked, and I doubt it has gone down since.

The run for this PR was 13m19s on CI, 3 minutes of which is TestBuildInfo (which really seems like it shouldn't take that long). This test was 50s, but most of that is downloading the docker images (which I hope should be shared between multiple tests in a single run).

I thought a lot about roachtests for this, and we even have a cdc one already, but in the end it just doesn't seem appropriate for this to me. This is going to be expanded with end-to-end tests to check that we live up to the ordering guarantees we claim. Really, I feel like this should be moved into the changefeedccl directory once I extract some of the docker helper code (so, not in this PR).

Additionally, making kafka in local mode work in a roachtest is going to be even more complex than this is. And I really need local mode for development.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we're close to the point where we want to run some tests in regular CI only when requested (for example, you as the main developer of CDC would have enough discipline to opt into this test during your PRs), but I agree that this isn't something to be done here.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 121 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I don't know actually. My understanding from stack overflow is that this gets an open port, gives it up, and then the caller immediately uses it. So there's a small race, but that seems fine. It seems to have been working that way in practice, picking ports up in the 50000s on my laptop. Is that not the case?

The concern is more on bigger shared CI machines, which doesn't apply to our environment (though as I mention above, I suspect it's happening occasionally). Nothing to worry about too much here, I think.


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 181 at r1 (raw file):

3 minutes of which is TestBuildInfo (which really seems like it shouldn't take that long)

Probably downloaded a big Docker image?

Copy link
Copy Markdown
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're close to the point where we want to run some tests in regular CI only when requested (for example, you as the main developer of CDC would have enough discipline to opt into this test during your PRs), but I agree that this isn't something to be done here.

Yeah, this sounds like a good option.

Filed #27292 to make sure this doesn't get lost.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/acceptanceccl/cdc_kafka_test.go, line 181 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

3 minutes of which is TestBuildInfo (which really seems like it shouldn't take that long)

Probably downloaded a big Docker image?

Looks like it's using TestLocal, so I don't think it's docker. I tried to repro it locally with some additional debugging, but it's finishing in 2s.

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented Jul 9, 2018

bors r=tschottdorf

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 9, 2018

👎 Rejected by code reviews

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

craig bot pushed a commit that referenced this pull request Jul 9, 2018
27134: acceptance: port the PauseUnpause test to use real kafka in docker r=tschottdorf a=danhhz

There's enough complexity in Kafka that there's no way to have any
confidence in end-to-end correctness testing based on a mock. We need
real Kafka. Both kafka and its zookeeper dependence are java, and so we
need to run them in Docker for these tests to be portable and
reproducible. (I know, I know.)

The major trick here is that kafka has an internal mechanism that lets
you talk to any node and it redirects you to the one that has the data
you're looking for. This is also used internally by the system. These
are configured as advertised hosts.

So, our CockroachDB changefeed needs to be able to reach Kafka at some
address configured in the CREATE CHANGEFEED. Then it receives the list
of advertised host:ports from Kafka, and it needs to be able to contact
all of those. The same is true for the kafka nodes and the consumer that
the test uses for to make assertions. Docker for mac really makes this
difficult. The kafka nodes are running in docker, so they want to be
able to talk to each other by their docker hostnames. We could also run
CockroachDB inside docker, but the test is running outside (where the
docker hosts don't resolve) and there's no easy way to change that.

The easiest thing would be docker's `--network=host`, but alas that's
not available with docker for mac.

Kafka (theoretically) allows for multiple sets of named advertised
listeners, differentiated by port. When you connect, it sends back the
ones relevant to the port you connected to. This is designed for exactly
this sort of situation. But this is tragically underdocumented and after
*literally tens of hours* I could not get this to work.

We could run some program inside docker to consume and proxy that
information out to the test somehow (e.g. tailing
`kafka-console-consumer`), but this is likely to introduce the same sort
of bugs we'd have with the mock.

We could also use `--network=container` instead of bridge networking,
which lets us share the same network namespace between a bunch of
containers. Then we make everything run on unique ports (and export
them) and use "localhost" for the host everywhere. This works well and
might be what we have to do if we need to run multi-node Kafka clusters.
However, all the necessary ports have to be exported from the first
container started, which requires some major surgery to DockerCluster.

In the end, what we do is similar. Zookeeper and Kafka are assigned
unique ports that are unassigned on the host. They run on that port
inside docker and it's mapped to the same port on the host. (Zookeeper
doesn't need to be available externally, but it was easy and sometimes
it's nice for debugging the test.) A one node Kafka cluster can talk to
itself on localhost and the unique port. CockroachDB also can, but only
from outside docker. And... uh... we're done. \o/

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 9, 2018

Build succeeded

@craig craig bot merged commit b94fe87 into cockroachdb:master Jul 9, 2018
@danhhz danhhz deleted the cdc_acceptance branch July 9, 2018 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants