Conversation
Following several discussions internal to the Tendermint engineering team, I am posting an RFC discussing the high-level details of the Tendermint team owning and operating a long-lived testnet in order to build experience running Tendermint, and to demonstrate that Tendermint is stable under production workloads. The outcome of this RFC will be a new track of work to begin building and maintaining a testnet associated with the main branch of tendermint. See the "Testnet MVP" section specifically for some of the first milestones. Note, I added the RFC where it would live once #9115 is merged to restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will need to be updated to include this RFC once #9115 is merged. This RFC is related to #9078.
thanethomson
left a comment
There was a problem hiding this comment.
Sounds like a great idea to me. What're the next steps from here?
We should probably establish when the best time would be to get the MVP up and running. Is this something that'd add value during our Q3 work?
| but rather to demonstrate that Tendermint blockchains as a whole can be stable | ||
| under a production load. | ||
| Of course we will inject faults periodically, but the intent is to observe and prove that | ||
| the testnet is resilient to those faults. |
There was a problem hiding this comment.
From this I derive that we can inject malicious/faulty behavior, but this behavior should not in principle halt or produce irrecoverable errors in the chain?
There was a problem hiding this comment.
I've thought about this for a few minutes and I'm not sure how to add it to the document.
If you knew that running a particular command did something strange on a 3-node testnet, and you were curious what effect it had on a 100-node testnet: I think it would be okay to run that against the main testnet, even with an uncertain risk that it would halt the testnet.
But if you were certain that you could halt the main testnet with a single command, then I would recommend that you would not run it -- unless perhaps you were going to confirm that the next software update was capable of recovering from said halt.
This is a kind of fuzzy area that I think is fine to omit from the document for now.
williambanfield
left a comment
There was a problem hiding this comment.
Thanks for putting this together! All comments I've made are as points of discussion, but I'm very happy with where the doc is so far.
| We may eventually discover that there is good reason to run more than one testnet for a branch, | ||
| perhaps due to a significant configuration variation. | ||
|
|
||
| ### Testnet lifecycle |
There was a problem hiding this comment.
Is the notion that the testnet will run continuously, with each deploy simply re-installing the tendermint code? Or would the goal be to create a fresh Tendermint instance - new DBs, new configs, etc - on every deploy?
If we keep the testnet running continuously, that will somewhat force us to ensure that intermediate versions of unreleased software are compatible with each other. I think that that burden may be a bit higher than we'd like to bear but I'm open to being convinced otherwise.
There was a problem hiding this comment.
I clarified in f6d33c4 that the testnet is intended to be run continuously, but engineers should be very empowered to destroy and recreate the testnet if they think handling an unreleased breaking change is not worth the time.
IMO, if it's a couple-line commit to make an internally breaking change forwards-compatible, and it's five minutes to write that code, I would hope the engineer would take that approach. If it would take at least an hour to write code to handle an unreleased breaking change, I would have no issue with the decision to wipe out the old testnet and recreate it.
| - The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub | ||
| (there are some omitted steps here, such as CI building appropriate binaries and | ||
| somehow notifying the testnet that a new build is available) | ||
| - The testnet runs the Tendermint KV store for MVP |
There was a problem hiding this comment.
KV store seems a-ok to me!
| a restart for a new binary | ||
| - The testnet operators are notified if any node stops updating blocks, | ||
| and by extension if a chain halt occurs | ||
| - The testnet has a minimum of 1 full node and 3 validators |
There was a problem hiding this comment.
I like how small it is to start. Get our feet wet while we build the muscle.
| - The testnet is running directly on VMs or compute instances; | ||
| while Kubernetes or other orchestration frameworks may offer many significant advantages, | ||
| the Tendermint engineers should not be required to learn those tools in order to | ||
| perform basic debugging |
There was a problem hiding this comment.
I would contend that using kube, in general, should be a non-goal of any such project. Our users almost exclusively do not use it from what I've observed. They run the binary on VMs and we should attempt to mimic what they do as much as possible.
There was a problem hiding this comment.
I've heard that the Cosmos ecosystem is a decent mix where the "smaller" validators tend to run directly on VMs, and the "larger" ones tend to run in containerized environments. But I don't have any hard data to back that up.
If that is accurate, I think there is some value in the long term in running Tendermint nodes in containers.
In either case, I am comfortable beginning without any plans for containerized nodes.
| Alternatively, we may discover a good reason to destroy and recreate the `main` testnet | ||
| (such as if we build a new application for running the testnet). | ||
|
|
||
| ### Testnet MVP |
There was a problem hiding this comment.
Mind defining a few 'observability' goals in this section as well? Basic prom / node-exporter is likely sufficient?
There was a problem hiding this comment.
In f6d33c4 I clarified that MVP has no observability goals other than alerting on a chain halt or unexpected process exits, but I did add some detail to the medium-term goals around observability. Prometheus and node-exporter should get us pretty far, but I think it is important that the application does publish some meaningful metrics too.
| - Chaos engineering has begun being integrated into the testnets | ||
| (this could be periodic CPU limiting, deliberate network interference, | ||
| deliberate filesystem corruption, etc.) |
There was a problem hiding this comment.
I don't expect Tendermint to survive FS corruption so I'm not sure this should exist as a medium-term goal.
| - The team has published some form of dashboards that have served well for debugging, | ||
| which external parties can copy/modify to their needs | ||
| - The team has produced a reference model of a log aggregation stack that external parties can use |
There was a problem hiding this comment.
I like both of these a lot, but they are basically their own projects. If we're going to do this, I think we should spend a bit of time on these dotting our i's and crossing our t's. A MVP of these dashboards will be probably a bit sloppy and ramping up to something releasable will take some polishing. My concern, ultimately, is around how we position any such stack/dashboard. If we communicate that this is the Tendermint stack, we may come to own its use in the wild in a way we don't intend. I think we'd need to make sure to position any such resources as 'things that we use but are very much on the user to tinker with'.
| - There is a centralized dashboard to get a quick overview of all testnets, | ||
| or at least one centralized dashboard per testnet, | ||
| showing TBD basic information | ||
| - Testnets include cloud spot instances which periodically and abruptly join and leave the network |
There was a problem hiding this comment.
Like this idea a lot.
| - Testnets have some manner of continuous profiling, | ||
| so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations |
There was a problem hiding this comment.
I think the prometheus node exporter can just get us this? Maybe there's more we need tho? If not, we can hit this goal pretty quickly.
There was a problem hiding this comment.
@joeabbey was running an experiment using a fork of some chain, where he was using the Go runtime to collect a profile for some particular ABCI step on every block. I could see providing that as some kind of opt-in configuration, or maybe it would be a configured sample rate, in the eventual custom application for the testnet.
Whether that is something we actually need, should become more clear as we approach the long term.
| As a rule of thumb, all engineers should be able to get shell access on any given instance | ||
| and should have access to the instance's logs. | ||
| Little if any further operational skills will be expected. | ||
| - The testnets are not intended to be _created_ for one-off experiments. |
There was a problem hiding this comment.
I'm not sure where, but I think we should endeavor to not do any experimentation on these networks. As soon as we introduce config-skew into the networks, we will reduce confidence in the results of running the testnets considerably.
"Wait, was that the node that you changed the max connections on? Maybe it's nothing" etc.
There was a problem hiding this comment.
f6d33c4 clarifies that it is okay to interact with the blockchain, but configuration should not be modified.
|
Further thought: I think it may be worth clarifying how much engineering time we expect to dedicate to this effort on an ongoing basis. It's nice to want powerful environments like this but if we expect it to take many full time engineers to maintain, we may need to rethink the direction. At the moment, I don't expect that to be the case from what's written here but I do think we should call out how much effort we expect to expend. |
Thanks for the approval -- I am going to merge this today. I have a bit more research to do, and some discussions to have with some SMEs to guide some of the initial implementation details. The outcome of the research and discussions will be a new set of issues tracking work towards MVP.
I am expecting to start work on this in earnest this week or next. It is too early to provide an estimate for when the MVP will be ready, but it does seem reasonable that this will add value towards the Q3 cycle, as we will necessarily be exercising more code paths, under new conditions that are difficult to replicate in unit tests. |
Yes, the testnets in this RFC will be continuously run throughout development. It seems completely reasonable to me to have an additional set of large testnets with varying configuration used as final verification of a release. |
|
It might have been helpful to have a section then on how this differs from the e2e tests which are run for every PR and every night i.e. the long lived testnet ensures upgrade compatibility across releases |
* docs/rfc: add testnet RFC Following several discussions internal to the Tendermint engineering team, I am posting an RFC discussing the high-level details of the Tendermint team owning and operating a long-lived testnet in order to build experience running Tendermint, and to demonstrate that Tendermint is stable under production workloads. The outcome of this RFC will be a new track of work to begin building and maintaining a testnet associated with the main branch of tendermint. See the "Testnet MVP" section specifically for some of the first milestones. Note, I added the RFC where it would live once #9115 is merged to restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will need to be updated to include this RFC once #9115 is merged. This RFC is related to #9078. * docs/rfc: minor updates to testnet rfc * docs/rfc: respond to more feedback on testnet RFC * docs/rfc: add RFC 023 to rfc index
* docs/rfc: add testnet RFC Following several discussions internal to the Tendermint engineering team, I am posting an RFC discussing the high-level details of the Tendermint team owning and operating a long-lived testnet in order to build experience running Tendermint, and to demonstrate that Tendermint is stable under production workloads. The outcome of this RFC will be a new track of work to begin building and maintaining a testnet associated with the main branch of tendermint. See the "Testnet MVP" section specifically for some of the first milestones. Note, I added the RFC where it would live once #9115 is merged to restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will need to be updated to include this RFC once #9115 is merged. This RFC is related to #9078. * docs/rfc: minor updates to testnet rfc * docs/rfc: respond to more feedback on testnet RFC * docs/rfc: add RFC 023 to rfc index
* docs/rfc: add testnet RFC Following several discussions internal to the Tendermint engineering team, I am posting an RFC discussing the high-level details of the Tendermint team owning and operating a long-lived testnet in order to build experience running Tendermint, and to demonstrate that Tendermint is stable under production workloads. The outcome of this RFC will be a new track of work to begin building and maintaining a testnet associated with the main branch of tendermint. See the "Testnet MVP" section specifically for some of the first milestones. Note, I added the RFC where it would live once #9115 is merged to restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will need to be updated to include this RFC once #9115 is merged. This RFC is related to #9078. * docs/rfc: minor updates to testnet rfc * docs/rfc: respond to more feedback on testnet RFC * docs/rfc: add RFC 023 to rfc index
Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.
The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.
Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.
This RFC is related to #9078.
{Rendered}