QA Process report for v0.37.x (and baseline for v0.34.x)#9499
QA Process report for v0.37.x (and baseline for v0.34.x)#9499sergio-mena merged 36 commits intomainfrom
Conversation
|
this is amazing, super good insights!! is there interest in testing:
|
This isn't planned for this round of QA, because every additional set of parameters represents an additional dimension to the tests. The costs of testing grow exponentially with each additional dimension. We could, however, potentially look into how we could incorporate variations on mempool size and consensus params into future rounds of QA or research. |
|
@marbar3778, further to @thanethomson's reply, when in doubt, we chose to set parameters to their defaults, so the reader can have an idea of what to expect if they don't tweak parameters. |
|
100%, was curious. This work is amazing, thank you for it!! |
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
…sses from memory plots
|
|
||
| We can observe there are some very high latencies, towards the end of the test. | ||
| Upon suspicion that they are duplicate transactions, we examined the latencies | ||
| raw file and discovered there are more than 100K duplicate transactions. |
There was a problem hiding this comment.
Why were there so many duplicate transactions? I thought the scheme we were using with the load generation tool would ensure uniqueness among the transactions by way of the value component of the key/value pairs?
There was a problem hiding this comment.
The reason why I ask this, is that this could explain why we had so few valid transactions during the tests. This skews the throughput rate calculation.
There was a problem hiding this comment.
The load generator produces unique transactions, the proof is that v0.37 doesn't contain any duplicate TX. The impression I have is that somehow some transactions are gossiped around and keep on being accepted in blocks. I know there is a protection built into the mempool to prevent this, but I guess it is not working all the time.
Gut feeling is that there is a bug that was triggered in the v0.34 run, but not in the 0.37 run. My next step on this is to look into @jmalicevic's test to see if the reason for what she saw is also duplicate TXs. If that is the case, I will add some info to the issue she wrote, else, I'll create another issue.
There was a problem hiding this comment.
Could it be that the same issue that affects the 200 node network at high load also affected this test? I.e. that the mempool cache is not large enough and so duplicate transactions end up being submitted.
There was a problem hiding this comment.
I think so. I added an extra sentence to line 204
Signed-off-by: Thane Thomson <connect@thanethomson.com>
|
|
||
| We can observe there are some very high latencies, towards the end of the test. | ||
| Upon suspicion that they are duplicate transactions, we examined the latencies | ||
| raw file and discovered there are more than 100K duplicate transactions. |
There was a problem hiding this comment.
Could it be that the same issue that affects the 200 node network at high load also affected this test? I.e. that the mempool cache is not large enough and so duplicate transactions end up being submitted.
Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
…vs throughput Signed-off-by: Thane Thomson <connect@thanethomson.com>
| The plot of latencies can we used as a baseline to compare with other releases. | ||
|
|
||
| ## Prometheus Metrics on the Chosen Experiment | ||
| The following plot summarizes average latencies versus overall throughputs |
There was a problem hiding this comment.
@thanethomson , do you agree to move line 88 below your paragraph (and make it plural). I believe the plot you have introduced should also used in the comparison against the baseline.
| * (non-critical, not fixed) [\#9537] - With the default mempool cache setting, | ||
| duplicated transactions are not rejected when gossipped and eventually flood | ||
| all mempools. The 200 node testnets were thus run with a value of 200000 (as | ||
| opposed to the default 10000) |
There was a problem hiding this comment.
Thanks for tidying this up :-)
| avg_latency = exp['total_latencies'] / exp['tx_count'] | ||
| exp_start_time = exp['block_time_min'] - exp['block_time_min_duration'] | ||
| exp_duration = exp['block_time_max'] - exp_start_time | ||
| throughput_rate = exp['tx_count'] / exp_duration |
There was a problem hiding this comment.
The rest of the text uses "transactions per minute". Shall we change this to tx/min, or the rest of the text to txs/s ?
There was a problem hiding this comment.
What's the usual convention when measuring throughput for such systems? I thought it was always txs/s? (cc @cason)
There was a problem hiding this comment.
There is no really convention here, although in our papers, and in most papers I am aware of, we use transactions per second, using base 10 multipliers if the numbers are too high (which is not the case here). If the convention is to adopt the international SI, we should use seconds.
I saw that the text use blocks per minute, which makes more sense since we don't have more than a block, actually not even one block, per second. I saw that in the text we use transactions per minute (e.g., 4154) to keep this correspondence with the blocks (e.g. 45), but I think we have enough transactions per second (e.g. 69.2) in order to this metric to make sense.
In summary, is always about readability: we can reason better with 45 block/min than with 0.75 block/s, and with 69.2 tx/s than 4154 tx/min. In particular, I think that we reason better about 4.154 Ktx/min than 4154 tx/min. The problem here is that is not natural, straightforward, to divide by 60 to see the correspondence between the two metrics...
There was a problem hiding this comment.
Yes, the timescales here do make more intuitive sense when thinking in minutes as opposed to seconds. Should I regenerate the latency vs throughput graphs using txs/min instead?
| # Sort stats for each number of connections in order of increasing | ||
| # throughput rate, and then extract average latencies and throughput rates | ||
| # as separate data series. | ||
| conns = sorted(stats.keys()) |
There was a problem hiding this comment.
Aren't we doing the same in line 52 ?
There was a problem hiding this comment.
Yes, but returning the sorted connection numbers would be a micro optimization that may hamper readability. Right now we're sorting an array of 3 integers here.
There was a problem hiding this comment.
I'm actually not sure why we need to sort these at all. We already have the data for throughput vs latency correlated, so what does doing this add?
There was a problem hiding this comment.
@thanethomson's script is plotting with lines, unlike the latencies plot. So I imagine you need to sort so that the lines are properly drawn. In the case of the connections loop, I understand it's for the legend to show the proper order
sergio-mena
left a comment
There was a problem hiding this comment.
I just reviewed @thanethomson's part.
I can't hit Approve as I would be approving my own commits, but I "Approve" Thane's part.
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
scripts/qa/reporting/README.md
Outdated
| # | ||
| # This will generate a plot in a PNG file called 'tm034.png' in the current | ||
| # directory based on the reporting tool CSV output in the "raw.csv" file. The | ||
| # default title of the plot is overridden below. |
There was a problem hiding this comment.
Not sure I understand what this line means @thanethomson ?
In the example below the -t flag is used to override the default name of the plot ?
There was a problem hiding this comment.
The default title text at the top of the plot is Tendermint latency vs throughput (defined here).
If you want to override that and customize the title text (mostly a cosmetic thing), then use the -t flag.
There was a problem hiding this comment.
"Title" is a specific term as defined by matplotlib, the plotting library I'm using: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html
There was a problem hiding this comment.
Right, it's not clear to me which thing in the below text is overriding the title. I think that clarifying that the -t flag is doing that would be helpful.
There was a problem hiding this comment.
It should hopefully be clearer now.
There was a problem hiding this comment.
Also, one could just run ./latency_throughput.py --help to get more information on script usage.
| # Compute average latency and throughput rate for each experiment | ||
| for exp_id, exp in experiments.items(): | ||
| conns = exp['connections'] | ||
| avg_latency = exp['total_latencies'] / exp['tx_count'] |
There was a problem hiding this comment.
avg_latency is helpful for understanding how the majority of transactions fared when executed on the network. I think 95th percentile latency would also be useful to plot to understand how the least lucky transactions did in these scenarios.
There was a problem hiding this comment.
I think percentiles are useful information too, but they are not that trivial to extract as the average.
Regarding that, I see very few use for the plots presenting all latencies. I would replace them by a CDF of latencies, that provide very useful information about the latency distribution. To produce a CDF we need to order all latencies by their "value", and normalize the transactions count into the 1-100 range.
There was a problem hiding this comment.
This is true. We have the full data set and it should be straightforward enough to find a python library to perform the computation. I'm fine leaving this as average, I think we've put in enough alteration to this report and it's helpful enough in its current form.
There was a problem hiding this comment.
Yeah grouping the data into percentiles should be easy enough. We can look at doing that for the next round of QA perhaps. Speaking of which, we should probably put together a "laundry list" issue keeping track of all the ways in which we'd want to improve the next round of QA.
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
williambanfield
left a comment
There was a problem hiding this comment.
What's presented here looks good. We definitely have room for improvement in future releases, but for a first go at generating this report, I think we have a great baseline.
It looks like @cason has a few comments left as well.
Yes, but none of them should prevent merging this PR. |
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* 1st version. 200 nodes. Missing rotating node * Small fixes * Addressed @jmalicevic's comment * Explain in method how to set the tmint version to test. Improve result section * 1st version of how to run the 'rotating node' testnet * Apply suggestions from @williambanfield Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Addressed @williambanfield's comments * Added reference to Unix load metric * Added total TXs * Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots * Report for rotating node * Adressed remaining comments from @williambanfield * Cosmetic * Addressed some of @thanethomson's comments * Re-executed the 200 node tests and updated the corresponding sections of the report * Ignore Python virtualenv directories Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add README for latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Fix local links to folders Signed-off-by: Thane Thomson <connect@thanethomson.com> * v034: only have one level-1 heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust headings Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add links to issues/PRs Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add note about bug being present in v0.34 Signed-off-by: Thane Thomson <connect@thanethomson.com> * method: adjust heading depths Signed-off-by: Thane Thomson <connect@thanethomson.com> * Show data points on latency vs throughput plot Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput plots Signed-off-by: Thane Thomson <connect@thanethomson.com> * Correct mentioning of v0.34.21 and add heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Refactor latency vs throughput script Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <connect@thanethomson.com> * Remove unused default duration const Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust experiment start time to be more accurate and re-plot latency vs throughput Signed-off-by: Thane Thomson <connect@thanethomson.com> * Addressed @williambanfield's comments * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * scripts: Update latency vs throughput readme for clarity Signed-off-by: Thane Thomson <connect@thanethomson.com> Signed-off-by: Thane Thomson <connect@thanethomson.com> Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Thane Thomson <connect@thanethomson.com> (cherry picked from commit b06e1ce)
* 1st version. 200 nodes. Missing rotating node * Small fixes * Addressed @jmalicevic's comment * Explain in method how to set the tmint version to test. Improve result section * 1st version of how to run the 'rotating node' testnet * Apply suggestions from @williambanfield Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Addressed @williambanfield's comments * Added reference to Unix load metric * Added total TXs * Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots * Report for rotating node * Adressed remaining comments from @williambanfield * Cosmetic * Addressed some of @thanethomson's comments * Re-executed the 200 node tests and updated the corresponding sections of the report * Ignore Python virtualenv directories Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add README for latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Fix local links to folders Signed-off-by: Thane Thomson <connect@thanethomson.com> * v034: only have one level-1 heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust headings Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add links to issues/PRs Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add note about bug being present in v0.34 Signed-off-by: Thane Thomson <connect@thanethomson.com> * method: adjust heading depths Signed-off-by: Thane Thomson <connect@thanethomson.com> * Show data points on latency vs throughput plot Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput plots Signed-off-by: Thane Thomson <connect@thanethomson.com> * Correct mentioning of v0.34.21 and add heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Refactor latency vs throughput script Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <connect@thanethomson.com> * Remove unused default duration const Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust experiment start time to be more accurate and re-plot latency vs throughput Signed-off-by: Thane Thomson <connect@thanethomson.com> * Addressed @williambanfield's comments * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * scripts: Update latency vs throughput readme for clarity Signed-off-by: Thane Thomson <connect@thanethomson.com> Signed-off-by: Thane Thomson <connect@thanethomson.com> Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Thane Thomson <connect@thanethomson.com> (cherry picked from commit b06e1ce)
* 1st version. 200 nodes. Missing rotating node * Small fixes * Addressed @jmalicevic's comment * Explain in method how to set the tmint version to test. Improve result section * 1st version of how to run the 'rotating node' testnet * Apply suggestions from @williambanfield Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Addressed @williambanfield's comments * Added reference to Unix load metric * Added total TXs * Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots * Report for rotating node * Adressed remaining comments from @williambanfield * Cosmetic * Addressed some of @thanethomson's comments * Re-executed the 200 node tests and updated the corresponding sections of the report * Ignore Python virtualenv directories Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add README for latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Fix local links to folders Signed-off-by: Thane Thomson <connect@thanethomson.com> * v034: only have one level-1 heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust headings Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add links to issues/PRs Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add note about bug being present in v0.34 Signed-off-by: Thane Thomson <connect@thanethomson.com> * method: adjust heading depths Signed-off-by: Thane Thomson <connect@thanethomson.com> * Show data points on latency vs throughput plot Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput plots Signed-off-by: Thane Thomson <connect@thanethomson.com> * Correct mentioning of v0.34.21 and add heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Refactor latency vs throughput script Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <connect@thanethomson.com> * Remove unused default duration const Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust experiment start time to be more accurate and re-plot latency vs throughput Signed-off-by: Thane Thomson <connect@thanethomson.com> * Addressed @williambanfield's comments * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * scripts: Update latency vs throughput readme for clarity Signed-off-by: Thane Thomson <connect@thanethomson.com> Signed-off-by: Thane Thomson <connect@thanethomson.com> Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Thane Thomson <connect@thanethomson.com> (cherry picked from commit b06e1ce) Co-authored-by: Sergio Mena <sergio@informal.systems>
) (#9578) * QA Process report for v0.37.x (and baseline for v0.34.x) (#9499) * 1st version. 200 nodes. Missing rotating node * Small fixes * Addressed @jmalicevic's comment * Explain in method how to set the tmint version to test. Improve result section * 1st version of how to run the 'rotating node' testnet * Apply suggestions from @williambanfield Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Addressed @williambanfield's comments * Added reference to Unix load metric * Added total TXs * Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots * Report for rotating node * Adressed remaining comments from @williambanfield * Cosmetic * Addressed some of @thanethomson's comments * Re-executed the 200 node tests and updated the corresponding sections of the report * Ignore Python virtualenv directories Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add README for latency vs throughput script Signed-off-by: Thane Thomson <connect@thanethomson.com> * Fix local links to folders Signed-off-by: Thane Thomson <connect@thanethomson.com> * v034: only have one level-1 heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust headings Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add links to issues/PRs Signed-off-by: Thane Thomson <connect@thanethomson.com> * v0.37.x: add note about bug being present in v0.34 Signed-off-by: Thane Thomson <connect@thanethomson.com> * method: adjust heading depths Signed-off-by: Thane Thomson <connect@thanethomson.com> * Show data points on latency vs throughput plot Signed-off-by: Thane Thomson <connect@thanethomson.com> * Add latency vs throughput plots Signed-off-by: Thane Thomson <connect@thanethomson.com> * Correct mentioning of v0.34.21 and add heading Signed-off-by: Thane Thomson <connect@thanethomson.com> * Refactor latency vs throughput script Update the latency vs throughput script to rather generate plots from the "raw" CSV output from the loadtime reporting tool as opposed to the separated CSV files from the experimental method. Also update the relevant documentation, and regenerate the images from the raw CSV data (resulting in pretty much the same plots as the previous ones). Signed-off-by: Thane Thomson <connect@thanethomson.com> * Remove unused default duration const Signed-off-by: Thane Thomson <connect@thanethomson.com> * Adjust experiment start time to be more accurate and re-plot latency vs throughput Signed-off-by: Thane Thomson <connect@thanethomson.com> * Addressed @williambanfield's comments * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> * scripts: Update latency vs throughput readme for clarity Signed-off-by: Thane Thomson <connect@thanethomson.com> Signed-off-by: Thane Thomson <connect@thanethomson.com> Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Thane Thomson <connect@thanethomson.com> (cherry picked from commit b06e1ce) * Remove v037 dir * Removed reference to v0.37 testnets Co-authored-by: Sergio Mena <sergio@informal.systems>
…x-rc1 * release/v0.37.0-rc1: QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) (tendermint#9577) Fix TX payload for DO testnets (tendermint#9540) (tendermint#9542) blocksync: retry requests after timeout (backport tendermint#9518) (tendermint#9533) Extend the load report tool to include transactions' hashes (tendermint#9509) (tendermint#9513) build(deps): Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 (tendermint#9502) build(deps): Bump actions/stale from 5 to 6 (tendermint#9494) loadtime: add block time to the data point (tendermint#9484) (tendermint#9489) config: Add missing storage section when generating config (tendermint#9483) (tendermint#9487) Sync Vote.Verify() in spec with implementation (tendermint#9466) (tendermint#9476) fix spec (tendermint#9467) (tendermint#9469) metrics: fix panic because of absent prometheus label (tendermint#9455) (tendermint#9474) Ensure Dockerfile stages use consistent Go version (backport tendermint#9462) (tendermint#9472) build(deps): Bump slackapi/slack-github-action from 1.21.0 to 1.22.0 (tendermint#9432) build(deps): Bump bufbuild/buf-setup-action from 1.7.0 to 1.8.0 (tendermint#9453) state: restore previous error message (tendermint#9435) (tendermint#9440) build(deps): Bump gonum.org/v1/gonum from 0.11.0 to 0.12.0 (tendermint#9411) docs: Update ADRs for v0.37 (tendermint#9399) (tendermint#9418) build(deps): Bump github.com/spf13/viper from 1.12.0 to 1.13.0 (tendermint#9410) build(deps): Bump github.com/lib/pq from 1.10.6 to 1.10.7 (tendermint#9409) feat: support HTTPS inside websocket (tendermint#9416) (tendermint#9422) Removed unused param (tendermint#9394) test: generate uuid on startup for load tool (tendermint#9383) (tendermint#9392) add redirect links (tendermint#9385) (tendermint#9389) refactor: migrate to cosmos/gogoproto (backport tendermint#9356) (tendermint#9381) cmd: print all versions of tendermint and its sub protocols (tendermint#9329) (tendermint#9386) Add missing changes changelog files (backport tendermint#9376) (tendermint#9382) add separated runs by UUID (backport tendermint#9367) (tendermint#9379) spec: abci++ cleanup for v0.37 (backport tendermint#9288) (tendermint#9374) ci: Remove "(WARNING: BETA SOFTWARE)" tagline from all upcoming releases (tendermint#9371) (tendermint#9372) Update rpc client header (tendermint#9276) (tendermint#9349) ci: Pre-release workflows (backport tendermint#9366) (tendermint#9368) test: add the loadtime report tool (tendermint#9351) (tendermint#9364) Update Tendermint version to v0.37.0 (tendermint#9354) test: add the loadtime tool (tendermint#9342) (tendermint#9357) # Conflicts: # version/version.go
Closes #9123 and #9333
(rendered)
This PR provides the QA report for both v0.34.x (as a baseline) and v0.37.x.
PR checklist
CHANGELOG_PENDING.mdupdated, or no changelog entry neededdocs/) and code comments, or nodocumentation updates needed