Skip to content

test: add the loadtime report tool#9351

Merged
williambanfield merged 17 commits intomainfrom
wb/issue-9331
Sep 2, 2022
Merged

test: add the loadtime report tool#9351
williambanfield merged 17 commits intomainfrom
wb/issue-9331

Conversation

@williambanfield
Copy link
Contributor

@williambanfield williambanfield commented Sep 1, 2022

This pull request adds the report tool and modifies the loadtime libraries to better support its use.
It moves the previously named loadtime tool to be called just load, built under test/loadtime/cmd/load/. The new report tool is built from test/loadtime/cmd/report. These commands cannot easily be made subcommands in the same binary because load makes use of the tm-load-test framework, which provides flags and subcommands of its own. Making both subcommands of the same binary would require surgery to that code base and likely is not worth it.

The report tool works by opening a handle to the Tendermint blockstore and reading all of the transactions out of the block. It calculates the latency of each transaction by differencing the timestamp listed in the transaction Payload and the timestamp of the block. It uses these latencies to calculate a set of metrics including the minimum, maximum, average and standard deviation value.

The tool currently treats all of the transactions it finds as being part of the same experimental run, but minor additional work is needed to separate transactions by experimental variables (Connections / Size / Rate as listed in the Payload data). See: #9352

Sample Output

./build/report --database-type goleveldb --data-dir ~/.tendermint/data
Total Valid Tx: 9000
Total Invalid Tx: 0
Minimum Latency: 351.103223ms
Maximum Latency: 1.040886707s
Average Latency: 703.895594ms
Standard Deviation: 214.400818ms

closes: #9331

PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

@williambanfield williambanfield marked this pull request as ready for review September 1, 2022 22:02
@williambanfield williambanfield requested a review from a team September 1, 2022 22:02
pdc <- payloadData{err: err}
continue
}
l := b.bt.Sub(p.Time.AsTime())
Copy link
Contributor

@sergio-mena sergio-mena Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that, for this scheme to work well, we'll need to configure NTP on nodes involved.
In order to (try to) catch the fact that clocks are not synchronized, could you add some (rough) sanity checks here? For instance:

  • the resulting duration should be positive
  • the resulting duration should be less than, say, 120 seconds

These could also be done by inspecting the results provided by the report (say, the script that calls report). Not sure what's best here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the resulting duration should be positive

I think it may make sense to just report a count of negative durations. I doubt we want to error in those cases since the data is still perhaps interesting. Does this seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about this, probably a single flag in the report (set to true if any delta is negative or ridiculously big) should suffice.
If we have evidence the the clocks aren't synchronized (because we forgot, or NTP daemon crashed, or was accidentally stopped in one of the machines), the simplest course of action to me would be to repeat the test from scratch with a good NTP setup, rather than trying to extract info from a report with skewed measurements.

In short, a count of negative (or very big) durations would be good, but we can also have just a general flag ("clocks likely not in sync").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ridiculously big

We have the Min and Max so users can determine this on their own.

but we can also have just a general flag ("clocks likely not in sync").

I'd prefer to not try to be much smarter than we need to be at the possible expense of clarity. Adding such a flag is doing the work of interpreting the data for the user which I don't think this tool should be doing. IMO, it should report the data in a manner that helps the user draw conclusions about what happened during the experimental run. It should be clear enough from the outputs given (large number of negative durations, or huge values for Max and Min) that the clocks are not in sync.

Copy link
Contributor

@sergio-mena sergio-mena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool job!

@williambanfield williambanfield added S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x labels Sep 2, 2022
@williambanfield williambanfield merged commit 8655080 into main Sep 2, 2022
@williambanfield williambanfield deleted the wb/issue-9331 branch September 2, 2022 17:32
mergify bot pushed a commit that referenced this pull request Sep 2, 2022
This pull request adds the report tool and modifies the loadtime libraries to better support its use.

(cherry picked from commit 8655080)
mergify bot pushed a commit that referenced this pull request Sep 2, 2022
This pull request adds the report tool and modifies the loadtime libraries to better support its use.

(cherry picked from commit 8655080)
williambanfield added a commit that referenced this pull request Sep 2, 2022
This pull request adds the report tool and modifies the loadtime libraries to better support its use.

(cherry picked from commit 8655080)

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
williambanfield added a commit that referenced this pull request Sep 2, 2022
* test: add the loadtime report tool (#9351)

This pull request adds the report tool and modifies the loadtime libraries to better support its use.

(cherry picked from commit 8655080)

* add nolint

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: William Banfield <wbanfield@gmail.com>
james-chf added a commit to heliaxdev/tendermint that referenced this pull request Nov 25, 2022
…x-rc1

* release/v0.37.0-rc1:
  QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) (tendermint#9577)
  Fix TX payload for DO testnets (tendermint#9540) (tendermint#9542)
  blocksync: retry requests after timeout (backport tendermint#9518) (tendermint#9533)
  Extend the load report tool to include transactions' hashes (tendermint#9509) (tendermint#9513)
  build(deps): Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 (tendermint#9502)
  build(deps): Bump actions/stale from 5 to 6 (tendermint#9494)
  loadtime: add block time to the data point (tendermint#9484) (tendermint#9489)
  config: Add missing storage section when generating config (tendermint#9483) (tendermint#9487)
  Sync Vote.Verify() in spec with implementation (tendermint#9466) (tendermint#9476)
  fix spec (tendermint#9467) (tendermint#9469)
  metrics: fix panic because of absent prometheus label (tendermint#9455) (tendermint#9474)
  Ensure Dockerfile stages use consistent Go version (backport tendermint#9462) (tendermint#9472)
  build(deps): Bump slackapi/slack-github-action from 1.21.0 to 1.22.0 (tendermint#9432)
  build(deps): Bump bufbuild/buf-setup-action from 1.7.0 to 1.8.0 (tendermint#9453)
  state: restore previous error message (tendermint#9435) (tendermint#9440)
  build(deps): Bump gonum.org/v1/gonum from 0.11.0 to 0.12.0 (tendermint#9411)
  docs: Update ADRs for v0.37 (tendermint#9399) (tendermint#9418)
  build(deps): Bump github.com/spf13/viper from 1.12.0 to 1.13.0 (tendermint#9410)
  build(deps): Bump github.com/lib/pq from 1.10.6 to 1.10.7 (tendermint#9409)
  feat: support HTTPS inside websocket (tendermint#9416) (tendermint#9422)
  Removed unused param (tendermint#9394)
  test: generate uuid on startup for load tool (tendermint#9383) (tendermint#9392)
  add redirect links (tendermint#9385) (tendermint#9389)
  refactor: migrate to cosmos/gogoproto (backport tendermint#9356) (tendermint#9381)
  cmd: print all versions of tendermint and its sub protocols  (tendermint#9329) (tendermint#9386)
  Add missing changes changelog files (backport tendermint#9376) (tendermint#9382)
  add separated runs by UUID (backport tendermint#9367) (tendermint#9379)
  spec: abci++ cleanup for v0.37 (backport tendermint#9288) (tendermint#9374)
  ci: Remove "(WARNING: BETA SOFTWARE)" tagline from all upcoming releases (tendermint#9371) (tendermint#9372)
  Update rpc client header (tendermint#9276) (tendermint#9349)
  ci: Pre-release workflows (backport tendermint#9366) (tendermint#9368)
  test: add the loadtime report tool (tendermint#9351) (tendermint#9364)
  Update Tendermint version to v0.37.0 (tendermint#9354)
  test: add the loadtime tool (tendermint#9342) (tendermint#9357)

# Conflicts:
#	version/version.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x

Projects

Status: Done/Merged

Development

Successfully merging this pull request may close these issues.

qa: Tool to estimate transaction latency from block data for kvstore app

2 participants