Skip to content

docs/rfc: add testnet RFC#9124

Merged
mark-rushakoff merged 7 commits intomainfrom
mr/rfc-testnet
Aug 1, 2022
Merged

docs/rfc: add testnet RFC#9124
mark-rushakoff merged 7 commits intomainfrom
mr/rfc-testnet

Conversation

@mark-rushakoff
Copy link
Contributor

@mark-rushakoff mark-rushakoff commented Jul 28, 2022

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.

{Rendered}

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.
Copy link
Contributor

@thanethomson thanethomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea to me. What're the next steps from here?

We should probably establish when the best time would be to get the MVP up and running. Is this something that'd add value during our Q3 work?

Copy link

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great document.

but rather to demonstrate that Tendermint blockchains as a whole can be stable
under a production load.
Of course we will inject faults periodically, but the intent is to observe and prove that
the testnet is resilient to those faults.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this I derive that we can inject malicious/faulty behavior, but this behavior should not in principle halt or produce irrecoverable errors in the chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this for a few minutes and I'm not sure how to add it to the document.

If you knew that running a particular command did something strange on a 3-node testnet, and you were curious what effect it had on a 100-node testnet: I think it would be okay to run that against the main testnet, even with an uncertain risk that it would halt the testnet.

But if you were certain that you could halt the main testnet with a single command, then I would recommend that you would not run it -- unless perhaps you were going to confirm that the next software update was capable of recovering from said halt.

This is a kind of fuzzy area that I think is fine to omit from the document for now.

Copy link
Contributor

@williambanfield williambanfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! All comments I've made are as points of discussion, but I'm very happy with where the doc is so far.

We may eventually discover that there is good reason to run more than one testnet for a branch,
perhaps due to a significant configuration variation.

### Testnet lifecycle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the notion that the testnet will run continuously, with each deploy simply re-installing the tendermint code? Or would the goal be to create a fresh Tendermint instance - new DBs, new configs, etc - on every deploy?

If we keep the testnet running continuously, that will somewhat force us to ensure that intermediate versions of unreleased software are compatible with each other. I think that that burden may be a bit higher than we'd like to bear but I'm open to being convinced otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified in f6d33c4 that the testnet is intended to be run continuously, but engineers should be very empowered to destroy and recreate the testnet if they think handling an unreleased breaking change is not worth the time.

IMO, if it's a couple-line commit to make an internally breaking change forwards-compatible, and it's five minutes to write that code, I would hope the engineer would take that approach. If it would take at least an hour to write code to handle an unreleased breaking change, I would have no issue with the decision to wipe out the old testnet and recreate it.

- The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub
(there are some omitted steps here, such as CI building appropriate binaries and
somehow notifying the testnet that a new build is available)
- The testnet runs the Tendermint KV store for MVP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KV store seems a-ok to me!

a restart for a new binary
- The testnet operators are notified if any node stops updating blocks,
and by extension if a chain halt occurs
- The testnet has a minimum of 1 full node and 3 validators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how small it is to start. Get our feet wet while we build the muscle.

Comment on lines +103 to +106
- The testnet is running directly on VMs or compute instances;
while Kubernetes or other orchestration frameworks may offer many significant advantages,
the Tendermint engineers should not be required to learn those tools in order to
perform basic debugging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would contend that using kube, in general, should be a non-goal of any such project. Our users almost exclusively do not use it from what I've observed. They run the binary on VMs and we should attempt to mimic what they do as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard that the Cosmos ecosystem is a decent mix where the "smaller" validators tend to run directly on VMs, and the "larger" ones tend to run in containerized environments. But I don't have any hard data to back that up.

If that is accurate, I think there is some value in the long term in running Tendermint nodes in containers.

In either case, I am comfortable beginning without any plans for containerized nodes.

Alternatively, we may discover a good reason to destroy and recreate the `main` testnet
(such as if we build a new application for running the testnet).

### Testnet MVP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind defining a few 'observability' goals in this section as well? Basic prom / node-exporter is likely sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In f6d33c4 I clarified that MVP has no observability goals other than alerting on a chain halt or unexpected process exits, but I did add some detail to the medium-term goals around observability. Prometheus and node-exporter should get us pretty far, but I think it is important that the application does publish some meaningful metrics too.

Comment on lines +128 to +130
- Chaos engineering has begun being integrated into the testnets
(this could be periodic CPU limiting, deliberate network interference,
deliberate filesystem corruption, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect Tendermint to survive FS corruption so I'm not sure this should exist as a medium-term goal.

Comment on lines +125 to +127
- The team has published some form of dashboards that have served well for debugging,
which external parties can copy/modify to their needs
- The team has produced a reference model of a log aggregation stack that external parties can use
Copy link
Contributor

@williambanfield williambanfield Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like both of these a lot, but they are basically their own projects. If we're going to do this, I think we should spend a bit of time on these dotting our i's and crossing our t's. A MVP of these dashboards will be probably a bit sloppy and ramping up to something releasable will take some polishing. My concern, ultimately, is around how we position any such stack/dashboard. If we communicate that this is the Tendermint stack, we may come to own its use in the wild in a way we don't intend. I think we'd need to make sure to position any such resources as 'things that we use but are very much on the user to tinker with'.

- There is a centralized dashboard to get a quick overview of all testnets,
or at least one centralized dashboard per testnet,
showing TBD basic information
- Testnets include cloud spot instances which periodically and abruptly join and leave the network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this idea a lot.

Comment on lines +153 to +154
- Testnets have some manner of continuous profiling,
so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prometheus node exporter can just get us this? Maybe there's more we need tho? If not, we can hit this goal pretty quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joeabbey was running an experiment using a fork of some chain, where he was using the Go runtime to collect a profile for some particular ABCI step on every block. I could see providing that as some kind of opt-in configuration, or maybe it would be a configured sample rate, in the eventual custom application for the testnet.

Whether that is something we actually need, should become more clear as we approach the long term.

As a rule of thumb, all engineers should be able to get shell access on any given instance
and should have access to the instance's logs.
Little if any further operational skills will be expected.
- The testnets are not intended to be _created_ for one-off experiments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where, but I think we should endeavor to not do any experimentation on these networks. As soon as we introduce config-skew into the networks, we will reduce confidence in the results of running the testnets considerably.

"Wait, was that the node that you changed the max connections on? Maybe it's nothing" etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f6d33c4 clarifies that it is okay to interact with the blockchain, but configuration should not be modified.

@williambanfield
Copy link
Contributor

Further thought: I think it may be worth clarifying how much engineering time we expect to dedicate to this effort on an ongoing basis. It's nice to want powerful environments like this but if we expect it to take many full time engineers to maintain, we may need to rethink the direction. At the moment, I don't expect that to be the case from what's written here but I do think we should call out how much effort we expect to expend.

@mark-rushakoff mark-rushakoff requested a review from a team July 29, 2022 20:06
Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this seen as something in parallel with the large scale testnets we will run before every release?

@mark-rushakoff
Copy link
Contributor Author

Sounds like a great idea to me. What're the next steps from here?

Thanks for the approval -- I am going to merge this today. I have a bit more research to do, and some discussions to have with some SMEs to guide some of the initial implementation details.

The outcome of the research and discussions will be a new set of issues tracking work towards MVP.

We should probably establish when the best time would be to get the MVP up and running. Is this something that'd add value during our Q3 work?

I am expecting to start work on this in earnest this week or next. It is too early to provide an estimate for when the MVP will be ready, but it does seem reasonable that this will add value towards the Q3 cycle, as we will necessarily be exercising more code paths, under new conditions that are difficult to replicate in unit tests.

@mark-rushakoff
Copy link
Contributor Author

Is this seen as something in parallel with the large scale testnets we will run before every release?

Yes, the testnets in this RFC will be continuously run throughout development. It seems completely reasonable to me to have an additional set of large testnets with varying configuration used as final verification of a release.

@mark-rushakoff mark-rushakoff merged commit b6a515a into main Aug 1, 2022
@mark-rushakoff mark-rushakoff deleted the mr/rfc-testnet branch August 1, 2022 15:33
@cmwaters
Copy link
Contributor

cmwaters commented Aug 1, 2022

It might have been helpful to have a section then on how this differs from the e2e tests which are run for every PR and every night i.e. the long lived testnet ensures upgrade compatibility across releases

samricotta pushed a commit that referenced this pull request Aug 1, 2022
* docs/rfc: add testnet RFC

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.

* docs/rfc: minor updates to testnet rfc

* docs/rfc: respond to more feedback on testnet RFC

* docs/rfc: add RFC 023 to rfc index
samricotta pushed a commit that referenced this pull request Aug 12, 2022
* docs/rfc: add testnet RFC

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.

* docs/rfc: minor updates to testnet rfc

* docs/rfc: respond to more feedback on testnet RFC

* docs/rfc: add RFC 023 to rfc index
samricotta pushed a commit that referenced this pull request Aug 16, 2022
* docs/rfc: add testnet RFC

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.

* docs/rfc: minor updates to testnet rfc

* docs/rfc: respond to more feedback on testnet RFC

* docs/rfc: add RFC 023 to rfc index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants