release-22.1: spanconfig,kv: merge adjacent ranges with identical configs by blathers-crl[bot] · Pull Request #80811 · cockroachdb/cockroach

blathers-crl · 2022-04-29T21:52:05Z

Backport 5/5 commits from #79700 on behalf of @irfansharif.

/cc @cockroachdb/release

spanconfig,kv: merge adjacent ranges with identical configs

Fixes #72389.
Fixes #66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

exec-sql tenant=11
CREATE DATABASE db;
CREATE TABLE db.t0();
CREATE TABLE db.t1();
CREATE TABLE db.t2();
CREATE TABLE db.t3();
CREATE TABLE db.t4();
CREATE TABLE db.t5();
CREATE TABLE db.t6();
CREATE TABLE db.t7();
CREATE TABLE db.t8();
CREATE TABLE db.t9();
ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
----

# If installing a unique zone config for a table in the middle, we
# should observe three splits (before, the table itself, and after).

diff offset=48
----
--- gossiped system config span (legacy)
+++ span config infrastructure (current)
...
 /Tenant/10                                 database system (tenant)
 /Tenant/11                                 database system (tenant)
+/Tenant/11/Table/106                       range default
+/Tenant/11/Table/111                       num_replicas=42
+/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
state when updating entries. On every update, we'd check if the span
we're writing to has the same config as the preceding and/or
succeeding one, and if so, write out a larger span instead. We
previously prototyped a form of this in #68491.

Pros:
- reduced memory footprint of spanconfig.Store
- read path (computing splits) stays fast -- just have to look up
  the right span from the backing interval tree
Cons:
- uses more persisted storage than necessary
- difficult to disable the optimization dynamically (though still
  possible -- we'd effectively have to restart the KVSubscriber and
  populate in-memory span config state using a store that
  does/doesn't coalesce configs)
- difficult to implement; our original prototype did not have #73150
  which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
maintained in reconciler itself.

Pros:
- reduced storage use (both persisted and in-memory)
- read path (computing splits) stays fast -- just have to look up
  the right span from the backing interval tree
Cons:
- very difficult to disable the optimization dynamically (each
  tenant process would need to be involved)
- most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
interface that accepts only a single update at a time and does not
generate deltas (not something we need down in KV). Removes the
implementation complexity.

(d) Keep the contents of system.span_configurations and the in-memory
state of spanconfig.Stores as it is today, uncoalesced. When
determining split points, iterate through adjacent configs within
the provided key bounds and see whether we could ignore certain
split keys.

Pros:
- easiest to implement
- easy to disable the optimization dynamically, for ex. through a
  cluster setting
Cons:
- uses more storage (persisted and in-memory) than necessary
- read path (computing splits) is more expensive if iterating
  through adjacent configs

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c).

spanconfig/store: use templatized btree impl instead

   $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                431093 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10              4308200 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10            43827373 ns/op
    PASS

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10    1.17ms ¬± 0%  0.43ms ¬± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   12.4ms ¬± 0%   4.3ms ¬± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10   136ms ¬± 0%    44ms ¬± 0%   ~     (p=1.000 n=1+1)

spanconfig/store: intern span configs

Avoid the expensive proto.Equal() when computing split keys;
Reduce memory overhead of the data structure

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                 90323 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10               915936 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10             9575781 ns/op

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10     431¬µs ¬± 0%    90¬µs ¬± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   4.31ms ¬± 0%  0.92ms ¬± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10  43.8ms ¬± 0%   9.6ms ¬± 0%   ~     (p=1.000 n=1+1)

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).

Release justification:

Release note: None

Fixes #72389. Fixes #66063 (gated behind a cluster setting). This should drastically reduce the total number of ranges in the system, especially when running with a large number of tables and/or tenants. To understand what the new set of split points are, consider the following test snippet: exec-sql tenant=11 CREATE DATABASE db; CREATE TABLE db.t0(); CREATE TABLE db.t1(); CREATE TABLE db.t2(); CREATE TABLE db.t3(); CREATE TABLE db.t4(); CREATE TABLE db.t5(); CREATE TABLE db.t6(); CREATE TABLE db.t7(); CREATE TABLE db.t8(); CREATE TABLE db.t9(); ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42; ---- # If installing a unique zone config for a table in the middle, we # should observe three splits (before, the table itself, and after). diff offset=48 ---- --- gossiped system config span (legacy) +++ span config infrastructure (current) ... /Tenant/10 database system (tenant) /Tenant/11 database system (tenant) +/Tenant/11/Table/106 range default +/Tenant/11/Table/111 num_replicas=42 +/Tenant/11/Table/112 range default This PR introduces two cluster settings to selectively opt into this optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled (defaulting to true and false respectively). We also don't coalesce system table ranges on the host tenant. We had a few implementation choices here: (a) Down in KV, augment the spanconfig.Store to coalesce the in-memory state when updating entries. On every update, we'd check if the span we're writing to has the same config as the preceding and/or succeeding one, and if so, write out a larger span instead. We previously prototyped a form of this in #68491. Pros: - reduced memory footprint of spanconfig.Store - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - uses more persisted storage than necessary - difficult to disable the optimization dynamically (though still possible -- we'd effectively have to restart the KVSubscriber and populate in-memory span config state using a store that does/doesn't coalesce configs) - difficult to implement; our original prototype did not have #73150 which is important for reducing reconciliation round trips (b) Same as (a) but coalesce configs up in the spanconfig.Store maintained in reconciler itself. Pros: - reduced storage use (both persisted and in-memory) - read path (computing splits) stays fast -- just have to look up the right span from the backing interval tree Cons: - very difficult to disable the optimization dynamically (each tenant process would need to be involved) - most difficult to implement (c) Same as (a) but through another API on the spanconfig.Store interface that accepts only a single update at a time and does not generate deltas (not something we need down in KV). Removes the implementation complexity. (d) Keep the contents of `system.span_configurations` and the in-memory state of spanconfig.Stores as it is today, uncoalesced. When determining split points, iterate through adjacent configs within the provided key bounds and see whether we could ignore certain split keys. Pros: - easiest to implement - easy to disable the optimization dynamically, for ex. through a cluster setting Cons: - uses more storage (persisted and in-memory) than necessary - read path (computing splits) is more expensive if iterating through adjacent configs ---- This PR implements option (d). For a benchmark on how slow (d) is going to be in practice with varying numbers of entries to be scanning over (10k, 100k, 1m): $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 1166842 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 12273375 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 140591766 ns/op PASS It's feasible that in the future we re-work this in favor of (c) possibly. Release note: None Release justification: high benefit change to existing functionality (affecting only multi-tenant clusters).

$ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 431093 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 4308200 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 43827373 ns/op PASS $ benchstat old.txt new.txt # from previous commit name old time/op new time/op delta StoreComputeSplitKey/num-entries=10000-10 1.17ms ¬± 0% 0.43ms ¬± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=100000-10 12.4ms ¬± 0% 4.3ms ¬± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=1000000-10 136ms ¬± 0% 44ms ¬± 0% ~ (p=1.000 n=1+1) Release note: None

- Avoid the expensive proto.Equal() when computing split keys; - Reduce memory overhead of the data structure $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v BenchmarkStoreComputeSplitKey BenchmarkStoreComputeSplitKey/num-entries=10000 BenchmarkStoreComputeSplitKey/num-entries=10000-10 90323 ns/op BenchmarkStoreComputeSplitKey/num-entries=100000 BenchmarkStoreComputeSplitKey/num-entries=100000-10 915936 ns/op BenchmarkStoreComputeSplitKey/num-entries=1000000 BenchmarkStoreComputeSplitKey/num-entries=1000000-10 9575781 ns/op $ benchstat old.txt new.txt # from previous commit name old time/op new time/op delta StoreComputeSplitKey/num-entries=10000-10 431¬µs ¬± 0% 90¬µs ¬± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=100000-10 4.31ms ¬± 0% 0.92ms ¬± 0% ~ (p=1.000 n=1+1) StoreComputeSplitKey/num-entries=1000000-10 43.8ms ¬± 0% 9.6ms ¬± 0% ~ (p=1.000 n=1+1) Release note: None

Release note: None

blathers-crl · 2022-04-29T21:52:07Z

cockroach-teamcity · 2022-04-29T21:52:24Z

This change is

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @irfansharif, and @nvanbenschoten)

irfansharif added 5 commits April 8, 2022 18:48

spanconfig: rename some files

233fde2

Release note: None

spanconfig/store: improve commentary, tests and naming

3d946da

Release note: None

blathers-crl bot requested a review from a team as a code owner April 29, 2022 21:52

blathers-crl bot requested a review from a team April 29, 2022 21:52

blathers-crl bot requested review from a team as code owners April 29, 2022 21:52

blathers-crl bot requested review from ajwerner, arulajmani, irfansharif and nvb April 29, 2022 21:52

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 29, 2022

blathers-crl bot assigned irfansharif Apr 29, 2022

ajwerner approved these changes May 3, 2022

View reviewed changes

irfansharif merged commit 8e92d07 into release-22.1 May 3, 2022

irfansharif deleted the blathers/backport-release-22.1-79700 branch May 3, 2022 04:05

irfansharif mentioned this pull request May 3, 2022

spanconfig,kv: merge adjacent ranges with identical configs #79700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.1: spanconfig,kv: merge adjacent ranges with identical configs #80811

release-22.1: spanconfig,kv: merge adjacent ranges with identical configs #80811
irfansharif merged 5 commits intorelease-22.1from
blathers/backport-release-22.1-79700

blathers-crl bot commented Apr 29, 2022

Uh oh!

blathers-crl bot commented Apr 29, 2022 •

edited by irfansharif

Loading

Uh oh!

cockroach-teamcity commented Apr 29, 2022

Uh oh!

ajwerner left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blathers-crl bot commented Apr 29, 2022

Uh oh!

blathers-crl bot commented Apr 29, 2022 • edited by irfansharif Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Apr 29, 2022

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blathers-crl bot commented Apr 29, 2022 •

edited by irfansharif

Loading