Skip to content

e2e: Extract Docker-specific functionality#8754

Merged
thanethomson merged 19 commits intomasterfrom
thane/7832-generic-e2e-runner
Jun 29, 2022
Merged

e2e: Extract Docker-specific functionality#8754
thanethomson merged 19 commits intomasterfrom
thane/7832-generic-e2e-runner

Conversation

@thanethomson
Copy link
Contributor

@thanethomson thanethomson commented Jun 14, 2022

Extract Docker-specific functionality in the E2E test framework and put it behind an interface that should hopefully, without too much modification, allow us to implement a Digital Ocean-based infrastructure provider.

Aims to partially address #7832.

Practically, the idea is to be able to switch between a Docker- and Digital Ocean-based infrastructure provider by passing a --provider flag to the E2E test runner to tell it which infra provider to use. docker is the default value, so it doesn't change any of the existing use of the runner CLI.

Extract Docker-specific functionality and put it behind an interface
that should hopefully, without too much modification, allow us to
implement a Digital Ocean-based infrastructure provider.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really happy with the direction this is moving. I definitely think the right first step is to create a re-usable interface that can work with different infrastructure backends. I left some comments geared at possibly simplifying the interfaces as well as a few thoughts on style. IMO, this probably won't be perfect in this PR and will likely need a bit of alteration when the Digital Ocean backend starts being added.

Comment on lines +16 to +17
// Stop attempts to stop the entire testnet.
Stop(ctx context.Context) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be a bit more specific about what it means to stop the entire testnet? In docker, these commands will stop the running instances but leave all of the data in place. This doesn't really generalize to digital ocean, where bringing the node 'down' will also remove all of its data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think it should do? From docker-compose help down:

Stops containers and removes containers, networks, volumes, and images
created by `up`.

By default, the only things removed are:

- Containers for services defined in the Compose file
- Networks defined in the `networks` section of the Compose file
- The default network, if one is used

Networks and volumes defined as `external` are never removed.

I'm not sure there can be an equivalent of an "external" volume for DO servers.

The other tricky bit is that the perturbations have this notion of "stopping" a node that purely involves sending it a SIGTERM, without removing any resources.

My initial thought was that we should only ever totally destroy infrastructure (implying wiping nodes' data) during Cleanup. What do you think? The implication there would be that, for complex tests that rotate new groups of peers in and kill of old groups of peers, we'd end up keeping all VMs running when we may only need some of them.

Comment on lines +39 to +41
// ProvisionNode attempts to provision infrastructure for the given node
// and starts it.
ProvisionNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the division in our infrastructure to have Provision and Start as different operations, we'll need to adjust the current e2e app docker image that we use.

The up command in docker-compose will start a running instance of the Tendermint node since the entrypoint for the image actually runs the process. I'm not sure we really need to maintain two different methods, Start and Provision when Start should really be sufficient. In the DO case, Start can bring the host up if it's down and then kickoff the process, whereas in the docker-compose case, start will just run the image.

The other option, as far as I can tell, is to have a Provision method that is a No-op in the docker case. However, if we're going to do that, I think that this logic should live in the TestnetInfra interface as a Provision method there since we Provision the entire network at once in Terraform, not node-by-node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I differentiated the two was because the perturbations have a notion of restarting a node that involves stopping and starting it without any change to infra provisioning, but I could just introduce a RestartNode function to subsume the two separate operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at it again now, there are 3 distinct categories of operations in use here:

  1. Testnet-wide operations
  2. Node-specific operations (including infrastructure)
  3. Operations on nodes' processes (not touching the underlying infrastructure)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at it again now, there are 3 distinct categories of operations in use here: ...

I think the distinction makes a bit more sense, although in the case of docker-compose the start+provision step are intermixed at the moment. Looks like the API has been updated to reflect that, which seems right.


// KillNode attempts to ensure that the given node's process is killed
// immediately using SIGKILL.
KillNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
KillNode(ctx context.Context, node *e2e.Node) error
Kill(ctx context.Context, node *e2e.Node) error

We can drop the word Node from these. A node is the only parameter of the method and also included in the name of the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
type NodeInfra interface {
// ProvisionNode attempts to provision infrastructure for the given node
// and starts it.
ProvisionNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the NodeInfra interface methods don't need the word "Node" in the method name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did this is that I'm only separating out the TestnetInfra and NodeInfra for code readability. The alternative here is to have all functions directly on the Infra interface, which may be a better idea here.

If we change the way things are named here we'll have clashes between the two sets of functionality (i.e. we need to be able to Start a testnet, and we also need to be able to Start a node, and the functions can't have the same name).

If it's a problem I'd rather just lump all of the functions together in a single interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the way things are named here we'll have clashes between the two sets of functionality (i.e. we need to be able to Start a testnet, and we also need to be able to Start a node, and the functions can't have the same name).

I sort of assumed based on this interface design that there wouldn't be situations where the same type would need to implement both interfaces, maybe though, this was not a clear assumption and we can figure out a different organization.

I'm still not convinced that the testnet implementation needs access to the internals of the node implementation, except as an optimization, is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced that the testnet implementation needs access to the internals of the node implementation, except as an optimization, is this correct?

Unfortunately not, the runner CLI provides commands by which users can manipulate individual nodes. I won't be able to reduce the surface area of the new interface without substantially reducing the functionality provided by the testnet runner at present, and I'm not sure what the unintended consequences of that will be. Perhaps we can set that as a longer-term goal once we've implemented the Digital Ocean-based infra API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the runner CLI provides commands by which users can manipulate individual nodes

This might actually be ok to remove. We should check with @cmwaters, but I don't think that any of these CLI operations are used for anything, except in theory making it easier to avoid reaching into the provider directly (which you can always do, and maybe ought to in these cases.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the runner CLI provides commands by which users can manipulate individual node

Which commands are these - I thought all commands were for an entire testnet


// TestnetInfra provides an API for interacting with the infrastructure for an
// entire testnet.
type TestnetInfra interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there ever a reason to have testnet implementations be specific to the provider and not just have the logs/start/stop be forloops over lists of node implementations?

Copy link
Contributor Author

@thanethomson thanethomson Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the common logic is already in the main runner functions (Start, Setup, Cleanup, etc. in the runner's main package). The only logic I've extracted is that which is explicitly Docker-related.

The way the E2E package has been architected is a bit different to the rest of the codebase, in that it seems as though its style is more functional. For example, the Testnet and Node structs are primarily just data and provide very little functionality, and the runner package provides a bunch of functions that perform operations using them.

Copy link

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high-level comments but by and large this seems sensible enough.


// Infra provides an API for interacting with testnet- and node-level
// infrastructure.
type Infra interface {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some benefit to separating these interfaces? It doesn't look as if we implement either without the other.

If we do want them to be separate for some reason I missed, I'd suggest making this a concrete type with the others as fields. Among other things, that will allow us to swap them out independently (though we could at that point also just pass them as separate parameters too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've consolidated the interfaces into a single one now.

Comment on lines +25 to +26
// ShowLogs will print all logs for the whole testnet to stdout.
ShowLogs(ctx context.Context) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this return an error? Is there anything actionable the caller can do with an error here?

Suggested change
// ShowLogs will print all logs for the whole testnet to stdout.
ShowLogs(ctx context.Context) error
// ShowLogs prints all logs for the testnet to stdout.
ShowLogs(ctx context.Context) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately: I realize Haskell uses "show" for this concept, but I think "print" might be more descriptive. YMMV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just seems like the convention used in the E2E tests. Returned errors are passed up through the stack to the Cobra CLI machinery and left to Cobra to deal with/display.

Comment on lines +55 to +57
// StartNode attempts to start a node's process. Assumes that the node's
// infrastructure has previously been provisioned using ProvisionNode.
StartNode(ctx context.Context, node *e2e.Node) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we require the implementation to report an error if the node wasn't provisioned? Or alternatively, could we require that StartNode call Provision automatically if it hasn't been? (Or maybe even just fold the two together?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now renamed this to StartNodeProcess - it's a specific bit of functionality required by the perturbation logic where we shouldn't be provisioning any new infrastructure.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson thanethomson marked this pull request as draft June 14, 2022 20:29
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson thanethomson marked this pull request as ready for review June 14, 2022 20:56
Comment on lines +48 to +55
// DisconnectNode modifies the specified node's network configuration such
// that it becomes bidirectionally disconnected from the network (it cannot
// see other nodes, and other nodes cannot see it).
DisconnectNode(ctx context.Context, node *e2e.Node) error

// ConnectNode modifies the specified node's network configuration such
// that it can become bidirectionally connected.
ConnectNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

@cmwaters cmwaters Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this is achieved? Are we modifying the config.toml or just halting the node?

Copy link
Contributor

@williambanfield williambanfield Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In docker, this works by adding the running container to the docker network. This effectively connects it to the other running nodes. The container can be configured to start on the same network as the other nodes, but if it is not then calling connect will put the nodes onto the same network.

Docker creates virtual networks for containers and containers don't run (by default) on the host's network interface (although they can be configured to behave this way).

https://docs.docker.com/engine/reference/commandline/network_connect/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this can be used to simulate the network itself failing?

How would this be achieved in DO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest way is to modify the node's iptables, cutting off incoming and outgoing connections to/from certain ports to "disconnect" a node and then restoring them to "connect" it.

For more advanced network outage simulation we'd need to install something like Toxiproxy on the nodes and configure it with the networking parameters we'd want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this can be used to simulate the network itself failing?

Correct.

How would this be achieved in DO?

A few ways I can think of

  1. Digital Ocean has a concept of a firewall that can be used to isolate nodes from IP ranges. I haven't looked much into it, but it looks like it can do the trick.
  2. If the firewall rules are too cumbersome or don't give us the flexibility we need, the backup is always to use IPtables rules DROP or REJECT rules depending on what we want to achieve. (Drop simulates a slow app not able to keep up with its ACKs, REJECT simulates a node that is completely unreachable behind a firewall (sends RST)).

Digital Ocean does seem to have an idea of a VPC (virtual private cloud networking) that allows nodes to all be on a shared network, but adding/removing nodes to the VPC seems sketchy and bad.
https://docs.digitalocean.com/products/networking/vpc/how-to/migrate-resources/


// TestnetInfra provides an API for manipulating the infrastructure of a
// specific testnet.
type TestnetInfra interface {
Copy link
Contributor

@cmwaters cmwaters Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this interface to include testnet wide functions? Can't the caller i.e. the CLI, just loop through these for each of the nodes it has in it's manifest.toml

I guess to elaborate, is Setup, Stop, Pause etc. offering functionality that the test harness can't do themselves by calling the other, node-specific, methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, although in the digital ocean case that may be way slower for things like spawning nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #8754 (comment), in this PR all I've done is extract the Docker-specific functionality and hide it behind an interface. It's already the bare minimum interface surface area with the current architecture. For example, you'll notice there's no Start function on the TestnetInfra interface - that's because the Start function in the main package of the runner just uses StartNode in a loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok your call. I just ideally want to have the interface be as simple as possible


// TestnetInfra provides an API for manipulating the infrastructure of a
// specific testnet.
type TestnetInfra interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, although in the digital ocean case that may be way slower for things like spawning nodes.

Comment on lines +72 to +74
// node's infrastructure has previously been provisioned using
// ProvisionNode.
StartNodeProcess(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since provision doesn't exist, I'm assuming that the only way to use this method is by using it on a node that's had Start called on it before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this method's only used by the perturbation logic at present.

Comment on lines +45 to +46
// StartNode provisions infrastructure for the given node and starts it.
StartNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

@williambanfield williambanfield Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the case of Digital Ocean, provisioning infrastructure is one of the slow steps and I think that doing it all at once with a single or small set of calls to DO is likely going to be pretty advantageous. I think we should add some mechanism to the API to do the provisioning all at once. Perhaps that can happen in Setup, that seems fine to me, but I just wanted to call it out. Perhaps this method will start the node, provisioning any infrastructure if necessary? For example, if we're adding a node to the test late in the test we may need to provision new infrastructure but for simpler tests we can provision all of the infrastructure up front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be the case on any VM-based provider.

I'm also having sort of flashbacks to all of the ways this can go wrong and get aborted or end up leaking nodes at great expense.

In addition to figuring out how this is going to work for DO (presuming that we're just calling out to terraform, perhaps its better to think of this as a "terraform provider" rather than a DO provider,) perhaps it makes sense to make sure that our interface design decisions would make sense for a k8s based deploy (just as an exercise.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some mechanism to the API to do the provisioning all at once.

The only real way to accomplish this is to rearchitect the testnet runner. I initially started doing this over the weekend but thought everyone may be more amenable to reviewing a smaller change, but I'd be happy to revive my alternative approach as I still have it stashed locally. It's just going to be a bigger PR and will look very different to the current one 🙂

But that approach would involve introducing a higher-level interface:

type Runner interface {
    Start(ctx context.Context) error
    Setup(ctx context.Context) error
    Cleanup(ctx context.Context) error
    // etc. - only testnet-level manipulation and not node- or process-level
}

I'm also having sort of flashbacks to all of the ways this can go wrong and get aborted or end up leaking nodes at great expense.

I don't think we need to be too concerned about this. It's easy to delete a DO "project" and all of its associated resources.

In addition to figuring out how this is going to work for DO (presuming that we're just calling out to terraform, perhaps its better to think of this as a "terraform provider" rather than a DO provider,) perhaps it makes sense to make sure that our interface design decisions would make sense for a k8s based deploy (just as an exercise.)

Having built and operated several k8s clusters the hard way from scratch, I'd be very much against using k8s for any of this testnet infra. I'd say we'd need a really good justification to even think of supporting k8s, and I don't see one at present that provides any advantage over a DO or AWS-based approach (a DO-based approach can be far more easily translated to AWS).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a funky idea, could you just have these methods take in an array of Nodes as opposed to just a single one. Then you could provision them together

Comment on lines +39 to +41
// ProvisionNode attempts to provision infrastructure for the given node
// and starts it.
ProvisionNode(ctx context.Context, node *e2e.Node) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at it again now, there are 3 distinct categories of operations in use here: ...

I think the distinction makes a bit more sense, although in the case of docker-compose the start+provision step are intermixed at the moment. Looks like the API has been updated to reflect that, which seems right.

@thanethomson thanethomson marked this pull request as draft June 15, 2022 14:51
@williambanfield
Copy link
Contributor

Just to follow up on this PR, since there are a number of comments on possible structuring:

this PR lists itself as extracting the Docker specific functionality. I think I raced ahead towards seeing where this would go with further providers and added comments to that effect, but I think that we are in a good spot to merge this as is with the understanding that the interfaces will change as other providers are added. They will need a bit of tweaking for things like managing hosts that come up slowly and starting vs. provisioning a node but I think we can cross that bridge when we come to it (hopefully in a very near-term next PR).

Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson thanethomson marked this pull request as ready for review June 21, 2022 19:00
@thanethomson
Copy link
Contributor Author

They will need a bit of tweaking for things like managing hosts that come up slowly and starting vs. provisioning a node but I think we can cross that bridge when we come to it (hopefully in a very near-term next PR).

Yes, that's definitely one of my next steps in this process. At the very least, this PR disentangles some of the basic testnet execution logic from the Docker/Docker Compose side effects.

Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style / ergonomics changes, but otherwise I'm very excited for this to land.


// TestnetInfra provides an API for provisioning and manipulating
// infrastructure for a Docker-based testnet.
type TestnetInfra struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero value of this type isn't useful so I would make it private.

if err != nil {
return err
// Cleanup destroys all infrastructure and removes all generated testnet files.
func Cleanup(ctx context.Context, logger log.Logger, testnet *e2e.Testnet, ti infra.TestnetInfra) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only really needs the dir from testnet, we can just pass only that through instead of the whole testnet. IMO, this is a readability improvement because it makes it easier to tell exactly what's needed by the function.

return err
}
switch providerID {
case "docker":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm not seeing it, but can we set docker as the default value of this flag so it's used even if the flag is not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
@thanethomson thanethomson merged commit 3bec166 into master Jun 29, 2022
@thanethomson thanethomson deleted the thane/7832-generic-e2e-runner branch June 29, 2022 12:02
tychoish added a commit that referenced this pull request Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants