Skip to content

release: 21.1 microbenchmark regression investigation and sign-off #62212

@nvb

Description

@nvb

This time, split out roughly by team! Of the 568 Go microbenchmarks in the cockroachdb/cockroach repository, a large handful have statistically significant regressions between the release-20.2 branch and the master branch.

We should determine the cause of each of these regressions and decide on whether they are deliberate or accidental and warrant further investigation.

Most of this process was automated with benchdiff. This tool can also be helpful to explore specific regressions here using a command like:

go get github.com/nvanbenschoten/benchdiff
benchdiff --new=master --old=release-20.2 --count=10 --run=<your_benchmark> <your_package>

Or, to automate the process of bisecting a regression, the tool can be used with a command like:

export test_name='BenchmarkEntryCacheClearTo'
export pkg_name='./pkg/kv/kvserver/raftentry'
export threshold=0.1
export base_sha=18b6937745c339264314413de61500e58253d9ee # near release-20.2, but easier to compile

# careful with this! It's an aggressive clean.
echo 'git clean -e benchdiff -e post-checkout.sh -dxff' >> post-checkout.sh
echo 'make buildshort' >> post-checkout.sh

git bisect start master ${base_sha}
git bisect run benchdiff --old=${base_sha} --post-checkout='bash post-checkout.sh' --run=${test_name} --count=20 --threshold=${threshold} ${pkg_name}

SQL

Packages:

./pkg/bench/...
./pkg/ccl/logictestccl
./pkg/ccl/oidcccl
./pkg/ccl/partitionccl
./pkg/ccl/sqlproxyccl
./pkg/col/...
./pkg/config/...
./pkg/sql/...
./pkg/sqlmigrations/...

Spreadsheet

Regressions (check to sign off):

  • KV/Update/SQL, @nvanbenschoten, bisected to 5910ab4, test changed, unclear why allocs/op changed, but 👍
  • Distinct/Ordered/..., @yuzefovich, bisected to 76968ff, the benchmark changed slightly (the source incurs more allocations now), yet the throughput has actually improved
  • Distinct/Unordered/..., @yuzefovich signed off because the regression is expected and is only on a small subset of cases; bisected on the first try to 76968ff, on the second try to d9f7bcb. The true cause is actually 11e6312.
  • TableReader/rows=16, @postamar + @ajwerner, bisected to 15ee387, this one looks real and serious, over a 25% reduction in throughput. After 15ee387 is addressed, there still remains 10% hit or so because of cee3bcf which is discussed elsewhere. @yuzefovich signed off
  • Router/MIRROR, not an actual regression, the difference is explained by TAGS=crdb_test being added if the benchmark binary built via make test
  • Append/bytes/AppendWithSel/NullProbability=0.2, can't reproduce
  • MergeJoiner, not reproducible
  • JoinReader/reqOrdering=false @yuzefovich signed off, same as TableReader/rows=16 above.
  • LogicalProjOp/OR,useSel=true,hasNulls=true, @yuzefovich signed off since this benchmark is known to be a bit flaky. The absolute speed is still very high, not worth digging in.
  • RowChannelPipeline/length=4, no regression, same as Router/MIRROR above
  • SortAll/rows=16, can't reproduce
  • AllSpooler, can't reproduce
  • SQL/MultinodeCockroach/InterleavedSelect/count=10, @yuzefovich saw 20% hit, out of which 6% or so can be attributed to using the vectorized engine, and the remaining difference likely has the same root cause as TableReader/rows=16 the significant increase in allocs is due to the vectorized engine and is expected, but the usage of vec engine doesn't actually reduce the throughput. Still there is about 6-8% hit when comparing against 20.2. @yuzefovich signed off since it doesn't seem worth looking into.
  • Select2/Cockroach @yuzefovich is investigating. The vectorized engine is responsible for 29% hit in throughput. The perf hit is reduced to 13% by colexec: optimize CASE, AND, and OR projection operators #62925, the remaining difference is caused by the fact the CASE, AND/OR logical operations have higher overhead in the vectorized engine, and this benchmark is basically the worst case for it. @yuzefovich signed off.
  • VirtualTableQueries/select_crdb_internal.tables_with_1_fk @rafiss investigated and found no culprit. since this test is new in 21.1 (but got backported to 20.2) it's not that easy to bisect. also, the roundtrips stat did decrease, which is more important since this is I/O bound not CPU/mem bound.
  • EndToEnd/tpcc-new-order/EndToEnd @rytaft bisected to sql: decrease vectorize_row_count_threshold to 0 #55713, "decrease vectorize_row_count_threshold to 0".
  • IsDatumEqual @rytaft investigated, and this is not a regression. If anything, looks like a small improvement.
  • WideTableIgnoreColumns/Cockroach, actually an improvement, likely same as Router/MIRROR above
  • Scan/Cockroach/count=1000/limit=10 @yuzefovich signed off. The 2x or so hit is explained by the switch to the vectorized engine by default for everything, and it is mitigated substantially by sql: hint scan batch size by expected row count #62282. The remaining difference is likely explained by TableReader/rows=16 above. mitigated by colfetcher: use the limit hint to size the batch #63020
  • Planning @rytaft investigated, at least half of the regression is due to using the vectorized engine. Couldn't pin the remainder of the regression to a single commit.
  • VecSkipScan/Bench_scan_with_skip, actually an improvement, likely same as Router/MIRROR above
  • AlterTableAddCheckConstraint/alter_table_add_3_check_constraints
  • NameResolution/MultinodeCockroach @rafiss investigated; bisected to sql: decrease vectorize_row_count_threshold to 0 #55713 (vectorize_row_count_threshold=0). seems expected, and also doesn't matter since the test is supposed to be testing name resolution.
  • SQL/Cockroach/InsertFK/count=1000/nFks=1 @RaduBerinde bisected to *: configure the use of separated intents in a cluster #59829, more info in there.
  • Phases/kv-read-const @mgartner bisected to opt: use safer struct initialization pattern #58386, mitigated partially by opt: speed up new init pattern #62767
  • IndexConstraints @mgartner bisected to opt/idxconstraint: recognize stored column expressions #55867 and opt: use safer struct initialization pattern #58386, regression is minimal and unavoidable
  • many more across time/op, alloc/op, allocs/op, and speed. roundtrips looks good though!

Bulk IO

Packages:

./pkg/blobs
./pkg/ccl/backupccl # not run, https://github.com/cockroachdb/cockroach/issues/62083
./pkg/ccl/baseccl
./pkg/ccl/cliccl
./pkg/ccl/importccl # not run, https://github.com/cockroachdb/cockroach/issues/62085
./pkg/jobs/...

Spreadsheet

Regressions (check to sign off):
None, though the two most important packages here are skipped.

KV

Packages:

./pkg/ccl/kvccl/...
./pkg/ccl/serverccl
./pkg/gossip/...
./pkg/kv/...
./pkg/kv/kvserver/gc # not run, too slow
./pkg/roachpb
./pkg/rpc/...
./pkg/server/...

Spreadsheet

Regressions (check to sign off):

Storage

Packages:

./pkg/ccl/storageccl/...
./pkg/storage/...

Spreadsheet

Regressions (check to sign off):

  • PebbleMapIteration
  • ScanDecodeKeyValue/getTs=true: @nvanbenschoten, bisected to 11abdda, 1ns difference due to new switch case in DecodeKey, 👍

Misc

Packages:

./pkg/base
./pkg/ccl/utilccl/...
./pkg/ccl/workloadccl/...
./pkg/cli/...
./pkg/clusterversion
./pkg/cmd/...
./pkg/internal/...
./pkg/security
./pkg/settings
./pkg/testutils/...
./pkg/ts/...
./pkg/util/...
./pkg/workload/...

Spreadsheet

Regressions (check to sign off):

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-benchmarkingC-performancePerf of queries or internals. Solution not expected to change functional behavior.GA-blocker

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions