Skip to content

release-22.1: spanconfig,kv: merge adjacent ranges with identical configs #80811

Merged
irfansharif merged 5 commits intorelease-22.1from
blathers/backport-release-22.1-79700
May 3, 2022
Merged

release-22.1: spanconfig,kv: merge adjacent ranges with identical configs #80811
irfansharif merged 5 commits intorelease-22.1from
blathers/backport-release-22.1-79700

Conversation

@blathers-crl
Copy link
Copy Markdown

@blathers-crl blathers-crl bot commented Apr 29, 2022

Backport 5/5 commits from #79700 on behalf of @irfansharif.

/cc @cockroachdb/release


spanconfig,kv: merge adjacent ranges with identical configs

Fixes #72389.
Fixes #66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

exec-sql tenant=11
CREATE DATABASE db;
CREATE TABLE db.t0();
CREATE TABLE db.t1();
CREATE TABLE db.t2();
CREATE TABLE db.t3();
CREATE TABLE db.t4();
CREATE TABLE db.t5();
CREATE TABLE db.t6();
CREATE TABLE db.t7();
CREATE TABLE db.t8();
CREATE TABLE db.t9();
ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
----

# If installing a unique zone config for a table in the middle, we
# should observe three splits (before, the table itself, and after).

diff offset=48
----
--- gossiped system config span (legacy)
+++ span config infrastructure (current)
...
 /Tenant/10                                 database system (tenant)
 /Tenant/11                                 database system (tenant)
+/Tenant/11/Table/106                       range default
+/Tenant/11/Table/111                       num_replicas=42
+/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
state when updating entries. On every update, we'd check if the span
we're writing to has the same config as the preceding and/or
succeeding one, and if so, write out a larger span instead. We
previously prototyped a form of this in #68491.

Pros:
- reduced memory footprint of spanconfig.Store
- read path (computing splits) stays fast -- just have to look up
  the right span from the backing interval tree
Cons:
- uses more persisted storage than necessary
- difficult to disable the optimization dynamically (though still
  possible -- we'd effectively have to restart the KVSubscriber and
  populate in-memory span config state using a store that
  does/doesn't coalesce configs)
- difficult to implement; our original prototype did not have #73150
  which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
maintained in reconciler itself.

Pros:
- reduced storage use (both persisted and in-memory)
- read path (computing splits) stays fast -- just have to look up
  the right span from the backing interval tree
Cons:
- very difficult to disable the optimization dynamically (each
  tenant process would need to be involved)
- most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
interface that accepts only a single update at a time and does not
generate deltas (not something we need down in KV). Removes the
implementation complexity.

(d) Keep the contents of system.span_configurations and the in-memory
state of spanconfig.Stores as it is today, uncoalesced. When
determining split points, iterate through adjacent configs within
the provided key bounds and see whether we could ignore certain
split keys.

Pros:
- easiest to implement
- easy to disable the optimization dynamically, for ex. through a
  cluster setting
Cons:
- uses more storage (persisted and in-memory) than necessary
- read path (computing splits) is more expensive if iterating
  through adjacent configs

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c).


spanconfig/store: use templatized btree impl instead

   $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                431093 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10              4308200 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10            43827373 ns/op
    PASS

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10    1.17ms ± 0%  0.43ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   12.4ms ± 0%   4.3ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10   136ms ± 0%    44ms ± 0%   ~     (p=1.000 n=1+1)

spanconfig/store: intern span configs

  • Avoid the expensive proto.Equal() when computing split keys;
  • Reduce memory overhead of the data structure
    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v
    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                 90323 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10               915936 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10             9575781 ns/op

    $ benchstat old.txt new.txt # from previous commit
    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10     431µs ± 0%    90µs ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   4.31ms ± 0%  0.92ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10  43.8ms ± 0%   9.6ms ± 0%   ~     (p=1.000 n=1+1)

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).


Release justification:

Release note: None
Fixes #72389.
Fixes #66063 (gated behind a cluster setting).

This should drastically reduce the total number of ranges in the system,
especially when running with a large number of tables and/or tenants. To
understand what the new set of split points are, consider the following
test snippet:

    exec-sql tenant=11
    CREATE DATABASE db;
    CREATE TABLE db.t0();
    CREATE TABLE db.t1();
    CREATE TABLE db.t2();
    CREATE TABLE db.t3();
    CREATE TABLE db.t4();
    CREATE TABLE db.t5();
    CREATE TABLE db.t6();
    CREATE TABLE db.t7();
    CREATE TABLE db.t8();
    CREATE TABLE db.t9();
    ALTER TABLE db.t5 CONFIGURE ZONE using num_replicas = 42;
    ----

    # If installing a unique zone config for a table in the middle, we
    # should observe three splits (before, the table itself, and after).

    diff offset=48
    ----
    --- gossiped system config span (legacy)
    +++ span config infrastructure (current)
    ...
     /Tenant/10                                 database system (tenant)
     /Tenant/11                                 database system (tenant)
    +/Tenant/11/Table/106                       range default
    +/Tenant/11/Table/111                       num_replicas=42
    +/Tenant/11/Table/112                       range default

This PR introduces two cluster settings to selectively opt into this
optimization: spanconfig.{tenant,host}_coalesce_adjacent.enabled
(defaulting to true and false respectively). We also don't coalesce
system table ranges on the host tenant. We had a few implementation
choices here:

(a) Down in KV, augment the spanconfig.Store to coalesce the in-memory
    state when updating entries. On every update, we'd check if the span
    we're writing to has the same config as the preceding and/or
    succeeding one, and if so, write out a larger span instead. We
    previously prototyped a form of this in #68491.

    Pros:
    - reduced memory footprint of spanconfig.Store
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - uses more persisted storage than necessary
    - difficult to disable the optimization dynamically (though still
      possible -- we'd effectively have to restart the KVSubscriber and
      populate in-memory span config state using a store that
      does/doesn't coalesce configs)
    - difficult to implement; our original prototype did not have #73150
      which is important for reducing reconciliation round trips

(b) Same as (a) but coalesce configs up in the spanconfig.Store
    maintained in reconciler itself.

    Pros:
    - reduced storage use (both persisted and in-memory)
    - read path (computing splits) stays fast -- just have to look up
      the right span from the backing interval tree
    Cons:
    - very difficult to disable the optimization dynamically (each
      tenant process would need to be involved)
    - most difficult to implement

(c) Same as (a) but through another API on the spanconfig.Store
    interface that accepts only a single update at a time and does not
    generate deltas (not something we need down in KV). Removes the
    implementation complexity.

(d) Keep the contents of `system.span_configurations` and the in-memory
    state of spanconfig.Stores as it is today, uncoalesced. When
    determining split points, iterate through adjacent configs within
    the provided key bounds and see whether we could ignore certain
    split keys.

    Pros:
    - easiest to implement
    - easy to disable the optimization dynamically, for ex. through a
      cluster setting
    Cons:
    - uses more storage (persisted and in-memory) than necessary
    - read path (computing splits) is more expensive if iterating
      through adjacent configs

----

This PR implements option (d). For a benchmark on how slow (d) is going
to be in practice with varying numbers of entries to be scanning over
(10k, 100k, 1m):

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10               1166842 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10             12273375 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10           140591766 ns/op
    PASS

It's feasible that in the future we re-work this in favor of (c)
possibly.

Release note: None
Release justification: high benefit change to existing functionality
(affecting only multi-tenant clusters).
    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                431093 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10              4308200 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10            43827373 ns/op
    PASS

    $ benchstat old.txt new.txt # from previous commit

    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10    1.17ms ± 0%  0.43ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   12.4ms ± 0%   4.3ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10   136ms ± 0%    44ms ± 0%   ~     (p=1.000 n=1+1)

Release note: None
- Avoid the expensive proto.Equal() when computing split keys;
- Reduce memory overhead of the data structure

    $ dev bench pkg/spanconfig/spanconfigstore -f=BenchmarkStoreComputeSplitKey -v

    BenchmarkStoreComputeSplitKey
    BenchmarkStoreComputeSplitKey/num-entries=10000
    BenchmarkStoreComputeSplitKey/num-entries=10000-10                 90323 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=100000
    BenchmarkStoreComputeSplitKey/num-entries=100000-10               915936 ns/op
    BenchmarkStoreComputeSplitKey/num-entries=1000000
    BenchmarkStoreComputeSplitKey/num-entries=1000000-10             9575781 ns/op

    $ benchstat old.txt new.txt # from previous commit

    name                                         old time/op  new time/op  delta
    StoreComputeSplitKey/num-entries=10000-10     431µs ± 0%    90µs ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=100000-10   4.31ms ± 0%  0.92ms ± 0%   ~     (p=1.000 n=1+1)
    StoreComputeSplitKey/num-entries=1000000-10  43.8ms ± 0%   9.6ms ± 0%   ~     (p=1.000 n=1+1)

Release note: None
@blathers-crl blathers-crl bot requested a review from a team as a code owner April 29, 2022 21:52
@blathers-crl blathers-crl bot requested a review from a team April 29, 2022 21:52
@blathers-crl blathers-crl bot requested review from a team as code owners April 29, 2022 21:52
@blathers-crl
Copy link
Copy Markdown
Author

blathers-crl bot commented Apr 29, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 29, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @irfansharif, and @nvanbenschoten)

@irfansharif irfansharif merged commit 8e92d07 into release-22.1 May 3, 2022
@irfansharif irfansharif deleted the blathers/backport-release-22.1-79700 branch May 3, 2022 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants