Skip to content

kv: re-enable time-bound iterators for RefreshRange request#74628

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/refreshTBI
Jan 29, 2022
Merged

kv: re-enable time-bound iterators for RefreshRange request#74628
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/refreshTBI

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jan 10, 2022

Closes #53348. The other two requests mentioned in that issue
(ResolveIntentRange and EndTxn) would no longer benefit from time-bound
iterators because, thanks to b5213fd, they no longer scan the MVCC keyspace.

Transaction refreshing is a form of an optimistic concurrency control validation
phase. Before a transaction can commit, if it will be committing at a timestamp
higher than its original timestamp, it issues point and ranged refresh requests
to the key spans it had previous read. The refresh requests scan a span of keys
and determine whether any new values have been written since the transaction
originally read the keys.

The use of time-bound iterators is an important optimization for ranged refresh
operations because we expect very few new writes between the time that a
transaction originally reads and the time that it refreshes.

Without this optimization, each refresh was redoing all of a transaction's reads
at their original cost. This effectively doubled the cost of reads for
transactions that had to refresh (or worse for those that refreshed multiple
times). With this optimization, refreshing a span of keys is expected to be
significantly cheaper than the original scan over that span of keys, because it
can ignore most files in the lower levels of the LSM.

RefreshRange requests were originally built to use time-bound iterators.
However, this optimization was disabled in 1eb3b2a due to concerns about
correctness. Since then, then correctness concerns have been addressed and
we have begun using time-bound iterators in a handful of places.

This commit re-enables time-bound iterators for RefreshRange requests. It does
so by using MVCCIncrementalIterator, which was enhanced to support additional
"intent policies" in 87c7f11. This commit uses the "emit" intent policy so that
RefreshRange will observe all values and all intents in the refresh time
window.


Microbenchmarks:

name                                                      old time/op    new time/op    delta
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     230ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     185ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     185ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10     123ms ± 1%       0ms ± 2%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10     123ms ± 1%       0ms ± 2%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10     123ms ± 0%       0ms ± 1%   -99.99%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     240ms ± 1%       0ms ± 1%   -99.96%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     237ms ± 1%       0ms ± 2%   -99.96%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     224ms ± 0%       0ms ± 1%   -99.95%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10     207ms ± 1%       0ms ± 3%   -99.95%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10       174ms ± 1%       0ms ± 1%   -99.93%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10       189ms ± 1%       0ms ± 1%   -99.86%  (p=0.000 n=9+9)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10        184ms ± 0%       0ms ± 0%   -99.85%  (p=0.000 n=8+9)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      252ms ± 0%       1ms ± 2%   -99.70%  (p=0.000 n=8+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10      412µs ± 1%      13µs ± 2%   -96.83%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10       413µs ± 1%      13µs ± 1%   -96.78%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10       292µs ± 1%      13µs ± 1%   -95.43%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10      245µs ± 1%      13µs ± 2%   -94.64%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10     245µs ± 0%      14µs ± 1%   -94.49%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10      238µs ± 1%      13µs ± 1%   -94.48%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10      237µs ± 1%      13µs ± 2%   -94.42%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10     237µs ± 2%      14µs ± 1%   -94.29%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10      292µs ± 1%      17µs ± 1%   -94.07%  (p=0.000 n=10+9)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10      225µs ± 2%      14µs ± 1%   -94.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10      224µs ± 1%      13µs ± 1%   -93.99%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10      224µs ± 1%      14µs ± 1%   -93.95%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10     244µs ± 0%      15µs ± 1%   -93.86%  (p=0.000 n=7+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10       237µs ± 1%      15µs ± 1%   -93.82%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10      224µs ± 1%      14µs ± 1%   -93.76%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10       244µs ± 1%      15µs ± 1%   -93.75%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10     238µs ± 1%      15µs ± 1%   -93.72%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10      236µs ± 1%      15µs ± 2%   -93.64%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10      236µs ± 1%      15µs ± 1%   -93.63%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10      244µs ± 0%      16µs ± 1%   -93.58%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10      244µs ± 1%      16µs ± 0%   -93.55%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10      287µs ± 1%      19µs ± 1%   -93.20%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10     237µs ± 1%      17µs ± 1%   -92.69%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10     288µs ± 2%      23µs ± 1%   -91.98%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      255ms ± 1%     122ms ± 1%   -52.02%  (p=0.000 n=9+9)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     242ms ± 1%     152ms ± 1%   -37.02%  (p=0.000 n=10+9)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     259ms ± 0%     354ms ± 1%   +36.73%  (p=0.000 n=7+9)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     256ms ± 1%     353ms ± 1%   +37.65%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      242ms ± 0%     398ms ± 1%   +64.38%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10      227ms ± 0%     392ms ± 1%   +72.65%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10     229ms ± 1%     512ms ± 1%  +123.45%  (p=0.000 n=9+9)

name                                                      old alloc/op   new alloc/op   delta
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     195MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     188MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      188MB ± 0%       0MB ± 1%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     148MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10    98.8MB ± 0%     0.0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      195MB ± 0%       0MB ± 3%  -100.00%  (p=0.000 n=9+8)
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     188MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     148MB ± 0%       0MB ± 0%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     148MB ± 0%       0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=8+9)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      148MB ± 0%       0MB ±29%   -99.99%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     148MB ± 0%       0MB ± 6%   -99.98%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10     98.7MB ± 0%     0.0MB ± 1%   -99.98%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     195MB ± 0%       0MB ± 4%   -99.97%  (p=0.000 n=9+8)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     188MB ± 0%       0MB ± 3%   -99.97%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10    98.7MB ± 0%     0.1MB ±12%   -99.92%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10      41.9kB ± 5%     1.1kB ± 0%   -97.27%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10      208kB ± 0%       8kB ± 0%   -96.04%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=8+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10       208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10       207kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10      207kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10      207kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10     207kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10       208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10       208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10      29.3kB ± 4%     1.2kB ± 0%   -95.95%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10       9.07kB ±25%    1.13kB ± 0%   -87.51%  (p=0.000 n=10+10)

name                                                      old allocs/op  new allocs/op  delta
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     18.8k ± 0%      0.0k ± 0%   -99.91%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     18.1k ± 0%      0.0k ± 0%   -99.91%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      17.5k ± 0%      0.0k ± 0%   -99.90%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     14.4k ± 0%      0.0k ± 0%   -99.88%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10     9.81k ± 0%     0.02k ± 0%   -99.83%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     18.2k ± 0%      0.1k ± 0%   -99.69%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     14.4k ± 0%      0.1k ± 0%   -99.61%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     14.4k ± 0%      0.1k ± 0%   -99.61%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      18.2k ± 0%      0.1k ± 2%   -99.51%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      13.8k ± 0%      0.2k ±28%   -98.85%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10      9.25k ± 0%     0.17k ± 6%   -98.11%  (p=0.000 n=10+9)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     14.2k ± 0%      0.3k ± 8%   -97.89%  (p=0.000 n=10+9)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10         535 ± 3%        17 ± 0%   -96.82%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     18.6k ± 0%      0.8k ± 3%   -95.56%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     17.9k ± 0%      0.8k ± 2%   -95.34%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10         351 ± 3%        18 ± 0%   -94.88%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10     9.59k ± 0%     0.95k ±12%   -90.05%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10         73.7 ±10%      17.0 ± 0%   -76.93%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10       80.0 ± 0%      56.0 ± 0%   -30.00%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10      79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10        79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10        79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10       80.0 ± 0%      58.0 ± 0%   -27.50%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10       80.0 ± 0%      58.0 ± 0%   -27.50%  (p=0.002 n=8+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10       79.7 ± 1%      58.0 ± 0%   -27.23%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10        79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10        79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)

Release note (performance improvement): transaction read refresh operations
performed during optimistic concurrency control's validation phase now use a
time-bound file filter when scanning the LSM tree. This allows these operations
to avoid scanning files that contain no keys written since the transaction
originally performed its reads.

@nvb nvb requested review from a team as code owners January 10, 2022 16:30
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/refreshTBI branch 3 times, most recently from fdc7045 to 42ba882 Compare January 10, 2022 20:41
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel, @nvanbenschoten, and @sumeerbhola)

a discussion (no related file):
Cool to see more code making use of TBI.



-- commits, line 102 at r4:
I suppose that the intuition between these cases getting worse is that these are cases that don't have any short circuiting so we are mostly paying for the overhead of the

Code quote:

  RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     259ms ± 0%     354ms ± 1%   +36.73%  (p=0.000 n=7+9)
  RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     256ms ± 1%     353ms ± 1%   +37.65%  (p=0.000 n=10+9)
  RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      242ms ± 0%     398ms ± 1%   +64.38%  (p=0.000 n=9+10)
  RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10      227ms ± 0%     392ms ± 1%   +72.65%  (p=0.000 n=9+10)
  RefreshRange/random-keys/refresh_window=[50.00,50.00]-10     229ms ± 1%     512ms ± 1%  +123.45%  (p=0.000 n=9+9)

pkg/kv/kvserver/batcheval/cmd_refresh_range_bench_test.go, line 83 at r4 (raw file):

		// We return a read only engine to prevent read-based
		// compactions after the initial data generation.
		"mixed-case": {

I kinda wonder about the usefulness of this case for these tests given the short-circuiting behaviour and the block property collectors. Looking at the post-TBI timings, it looks rather similar in performance to the random-keys case.

Copy link
Copy Markdown
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r1, 3 of 4 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 34 at r4 (raw file):

	"kv.refresh_range.time_bound_iterators.enabled",
	"use time-bound iterators when performing ranged transaction refreshes",
	util.ConstantWithMetamorphicTestBool("kv.refresh_range.time_bound_iterators_enabled", true),

nit - no idea whether this string should match the one 2 lines above - should it be _enabled or .enabled?

Code quote:

_

pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 121 at r4 (raw file):

				iter.SeekGE(storage.MVCCKey{
					Key:       key.Key,
					Timestamp: meta.Timestamp.ToTimestamp().Prev(),

sorry I can't see why this works (though I do understand it works..) - we found a key (intent) AA at time 7 and now we will SeekGE starting from key AA time 6? I know I'm missing something basic, maybe the comment can clear it a bit?

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice results!

Reviewed 3 of 5 files at r1, 4 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @stevendanna)


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 32 at r4 (raw file):

var refreshRangeTBIEnabled = settings.RegisterBoolSetting(
	settings.SystemOnly,
	"kv.refresh_range.time_bound_iterators.enabled",

Does this cluster setting exist because we have a few microbenchmarks that show a performance regression?


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 64 at r4 (raw file):

		storage.MVCCScanOptions{
			Inconsistent: true,
			Tombstones:   true,

What if gc has run and cleaned up the tombstone? Will we incorrectly think that nothing has changed since we've not done the real check that the readable value is what we read previously?


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 115 at r4 (raw file):

				return errors.Wrapf(err, "unmarshaling mvcc meta: %v", key)
			}
			if meta.Txn.ID == txnID {

do we need to ignore an inline meta too?


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 121 at r4 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

sorry I can't see why this works (though I do understand it works..) - we found a key (intent) AA at time 7 and now we will SeekGE starting from key AA time 6? I know I'm missing something basic, maybe the comment can clear it a bit?

Say the sequence of keys is AA@0, AA@7, AA@4. We have found AA@0 (the intents appear as 0 timestamped keys). SeekGE(AA@6) will go to AA@4.

@nvanbenschoten can we do iter.Next, validate that iter is not exhausted and the timestamp is as expected, and then do another iter.Next and continue? Despite the recent SeekGE optimization added by Bilal it is likely slightly cheaper to do Next given that we are interleaving many underlying iterators (both MVCCIncrementalIterator and intentInterleavingIter).


pkg/storage/mvcc_incremental_iterator.go, line 150 at r4 (raw file):

	// EndTime]. Note that if {Min,Max}TimestampHints are specified in
	// IterOptions, the timestamp hints interval should include the start and end
	// time.

thanks for cleaning up the stale comments.


pkg/kv/kvserver/batcheval/cmd_refresh_range_bench_test.go, line 83 at r4 (raw file):

Previously, stevendanna (Steven Danna) wrote…

I kinda wonder about the usefulness of this case for these tests given the short-circuiting behaviour and the block property collectors. Looking at the post-TBI timings, it looks rather similar in performance to the random-keys case.

Regarding block property collectors, I think it made little difference to the CatchUpScan workload because we are either in a regime where sstables are nicely segmented or we end up with each block containing a wide time span of keys. Don't know if that is representative.

More a question for @stevendanna -- should we abstract out and share code from here and CatchUpScan so we can more easily add more workloads, by doing so once.


pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go, line 259 at r4 (raw file):

		// RefreshTo is inclusive, so expect error on collision.
		{ts1, ts2, true},
		// RefreshTo is exclusive, so expect no error on collision.

RefreshFrom

nvb added 3 commits January 27, 2022 10:58
We were previously treating `RefreshRequest` and `RefreshRangeRequest`'s
`RefreshFrom` timestamp as an inclusive lower bound. However, this was not
necessary because a refresh will assign the timestamp of the previous read
as the `RefreshFrom` timestamp. This timestamp must have been used to bump
the timestamp cache during the previous read, so it is not possible for a
value to have been written at the `RefreshFrom` timestamp _after_ the prior
read. This means that if there is a value at the `RefreshFrom` timestamp,
it must have been included in the prior read's result set.

This is a very minor optimization, as it avoids refresh failures in a few
edge cases (see tests). More importantly, it sets us up to use time-bound
iterators for `RefreshRangeRequest` again — cockroachdb#53348.
Closes cockroachdb#53348. The other two requests mentioned in that issue
(`ResolveIntentRange` and `EndTxn`) would no longer benefit from time-bound
iterators because, thanks to b5213fd, they no longer scan the MVCC keyspace.

Transaction refreshing is a form of an optimistic concurrency control validation
phase. Before a transaction can commit, if it will be committing at a timestamp
higher than its original timestamp, it issues point and ranged refresh requests
to the key spans it had previous read. The refresh requests scan a span of keys
and determine whether any new values have been written since the transaction
originally read the keys.

The use of time-bound iterators is an important optimization for ranged refresh
operations because we expect very few new writes between the time that a
transaction originally reads and the time that it refreshes.

Without this optimization, each refresh was redoing all of a transaction's reads
at their original cost. This effectively doubled the cost of reads for
transactions that had to refresh (or worse for those that refreshed multiple
times). With this optimization, refreshing a span of keys is expected to be
significantly cheaper than the original scan over that span of keys, because it
can ignore most files in the lower levels of the LSM.

RefreshRange requests were originally built to use time-bound iterators.
However, this optimization was disabled in 1eb3b2a due to concerns about
correctness. Since then, then correctness concerns have been addressed and
we have begun using time-bound iterators in a handful of places.

This commit re-enables time-bound iterators for `RefreshRange` requests. It does
so by using `MVCCIncrementalIterator`, which was enhanced to support additional
"intent policies" in 87c7f11. This commit uses the "emit" intent policy so that
`RefreshRange` will observe all values and all intents in the refresh time
window.

----

Microbenchmarks:
```
name                                                      old time/op    new time/op    delta
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     230ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     185ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     185ms ± 1%       0ms ± 1%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10     123ms ± 1%       0ms ± 2%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10     123ms ± 1%       0ms ± 2%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10     123ms ± 0%       0ms ± 1%   -99.99%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     240ms ± 1%       0ms ± 1%   -99.96%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     237ms ± 1%       0ms ± 2%   -99.96%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     224ms ± 0%       0ms ± 1%   -99.95%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10     207ms ± 1%       0ms ± 3%   -99.95%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10       174ms ± 1%       0ms ± 1%   -99.93%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10       189ms ± 1%       0ms ± 1%   -99.86%  (p=0.000 n=9+9)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10        184ms ± 0%       0ms ± 0%   -99.85%  (p=0.000 n=8+9)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      252ms ± 0%       1ms ± 2%   -99.70%  (p=0.000 n=8+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10      412µs ± 1%      13µs ± 2%   -96.83%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10       413µs ± 1%      13µs ± 1%   -96.78%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10       292µs ± 1%      13µs ± 1%   -95.43%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10      245µs ± 1%      13µs ± 2%   -94.64%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10     245µs ± 0%      14µs ± 1%   -94.49%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10      238µs ± 1%      13µs ± 1%   -94.48%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10      237µs ± 1%      13µs ± 2%   -94.42%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10     237µs ± 2%      14µs ± 1%   -94.29%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10      292µs ± 1%      17µs ± 1%   -94.07%  (p=0.000 n=10+9)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10      225µs ± 2%      14µs ± 1%   -94.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10      224µs ± 1%      13µs ± 1%   -93.99%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10      224µs ± 1%      14µs ± 1%   -93.95%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10     244µs ± 0%      15µs ± 1%   -93.86%  (p=0.000 n=7+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10       237µs ± 1%      15µs ± 1%   -93.82%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10      224µs ± 1%      14µs ± 1%   -93.76%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10       244µs ± 1%      15µs ± 1%   -93.75%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10     238µs ± 1%      15µs ± 1%   -93.72%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10      236µs ± 1%      15µs ± 2%   -93.64%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10      236µs ± 1%      15µs ± 1%   -93.63%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10      244µs ± 0%      16µs ± 1%   -93.58%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10      244µs ± 1%      16µs ± 0%   -93.55%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10      287µs ± 1%      19µs ± 1%   -93.20%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10     237µs ± 1%      17µs ± 1%   -92.69%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10     288µs ± 2%      23µs ± 1%   -91.98%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      255ms ± 1%     122ms ± 1%   -52.02%  (p=0.000 n=9+9)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     242ms ± 1%     152ms ± 1%   -37.02%  (p=0.000 n=10+9)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     259ms ± 0%     354ms ± 1%   +36.73%  (p=0.000 n=7+9)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     256ms ± 1%     353ms ± 1%   +37.65%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      242ms ± 0%     398ms ± 1%   +64.38%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10      227ms ± 0%     392ms ± 1%   +72.65%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10     229ms ± 1%     512ms ± 1%  +123.45%  (p=0.000 n=9+9)

name                                                      old alloc/op   new alloc/op   delta
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     195MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     188MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      188MB ± 0%       0MB ± 1%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     148MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10    98.8MB ± 0%     0.0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      195MB ± 0%       0MB ± 3%  -100.00%  (p=0.000 n=9+8)
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     188MB ± 0%       0MB ± 0%  -100.00%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     148MB ± 0%       0MB ± 0%   -99.99%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     148MB ± 0%       0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=8+9)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10    99.0MB ± 0%     0.0MB ± 0%   -99.99%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      148MB ± 0%       0MB ±29%   -99.99%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     148MB ± 0%       0MB ± 6%   -99.98%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10     98.7MB ± 0%     0.0MB ± 1%   -99.98%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     195MB ± 0%       0MB ± 4%   -99.97%  (p=0.000 n=9+8)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     188MB ± 0%       0MB ± 3%   -99.97%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10    98.7MB ± 0%     0.1MB ±12%   -99.92%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10      41.9kB ± 5%     1.1kB ± 0%   -97.27%  (p=0.000 n=9+9)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10      208kB ± 0%       8kB ± 0%   -96.04%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=8+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10       208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+9)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10       207kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10      207kB ± 0%       8kB ± 0%   -96.03%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10     208kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10      207kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10     207kB ± 0%       8kB ± 0%   -96.02%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10       208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10       208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10      208kB ± 0%       8kB ± 0%   -96.01%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10      29.3kB ± 4%     1.2kB ± 0%   -95.95%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10       9.07kB ±25%    1.13kB ± 0%   -87.51%  (p=0.000 n=10+10)

name                                                      old allocs/op  new allocs/op  delta
RefreshRange/linear-keys/refresh_window=[99.00,99.00]-10     18.8k ± 0%      0.0k ± 0%   -99.91%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[95.00,95.00]-10     18.1k ± 0%      0.0k ± 0%   -99.91%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,95.00]-10      17.5k ± 0%      0.0k ± 0%   -99.90%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[75.00,75.00]-10     14.4k ± 0%      0.0k ± 0%   -99.88%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,50.00]-10     9.81k ± 0%     0.02k ± 0%   -99.83%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[95.00,99.00]-10     18.2k ± 0%      0.1k ± 0%   -99.69%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[75.00,95.00]-10     14.4k ± 0%      0.1k ± 0%   -99.61%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[75.00,99.00]-10     14.4k ± 0%      0.1k ± 0%   -99.61%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[99.00,99.00]-10      18.2k ± 0%      0.1k ± 2%   -99.51%  (p=0.000 n=10+8)
RefreshRange/linear-keys/refresh_window=[50.00,95.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,75.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[50.00,99.00]-10     9.62k ± 0%     0.06k ± 0%   -99.40%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[75.00,75.00]-10      13.8k ± 0%      0.2k ±28%   -98.85%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[50.00,50.00]-10      9.25k ± 0%     0.17k ± 6%   -98.11%  (p=0.000 n=10+9)
RefreshRange/random-keys/refresh_window=[75.00,75.00]-10     14.2k ± 0%      0.3k ± 8%   -97.89%  (p=0.000 n=10+9)
RefreshRange/linear-keys/refresh_window=[0.00,0.00]-10         535 ± 3%        17 ± 0%   -96.82%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[99.00,99.00]-10     18.6k ± 0%      0.8k ± 3%   -95.56%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[95.00,95.00]-10     17.9k ± 0%      0.8k ± 2%   -95.34%  (p=0.000 n=10+8)
RefreshRange/random-keys/refresh_window=[0.00,0.00]-10         351 ± 3%        18 ± 0%   -94.88%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[50.00,50.00]-10     9.59k ± 0%     0.95k ±12%   -90.05%  (p=0.000 n=9+10)
RefreshRange/mixed-case/refresh_window=[0.00,0.00]-10         73.7 ±10%      17.0 ± 0%   -76.93%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,50.00]-10       80.0 ± 0%      56.0 ± 0%   -30.00%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,50.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=9+10)
RefreshRange/random-keys/refresh_window=[95.00,99.00]-10      79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,50.00]-10        79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,75.00]-10        79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,75.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[95.00,99.00]-10       79.0 ± 0%      56.0 ± 0%   -29.11%  (p=0.000 n=10+10)
RefreshRange/linear-keys/refresh_window=[0.00,75.00]-10       80.0 ± 0%      58.0 ± 0%   -27.50%  (p=0.000 n=9+10)
RefreshRange/linear-keys/refresh_window=[0.00,95.00]-10       80.0 ± 0%      58.0 ± 0%   -27.50%  (p=0.002 n=8+10)
RefreshRange/linear-keys/refresh_window=[0.00,99.00]-10       79.7 ± 1%      58.0 ± 0%   -27.23%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,75.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,95.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[0.00,99.00]-10       79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,75.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,95.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[50.00,99.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,95.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/random-keys/refresh_window=[75.00,99.00]-10      79.0 ± 0%      58.0 ± 0%   -26.58%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,95.00]-10        79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[0.00,99.00]-10        79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,95.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[50.00,99.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,95.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
RefreshRange/mixed-case/refresh_window=[75.00,99.00]-10       79.0 ± 0%      60.0 ± 0%   -24.05%  (p=0.000 n=10+10)
```

----

Release note (performance improvement): transaction read refresh operations
performed during optimistic concurrency control's validation phase now use a
time-bound file filter when scanning the LSM tree. This allows these operations
to avoid scanning files that contain no keys written since the transaction
originally performed its reads.
This commit updates logic in `refreshRange` and in `CatchUpIterator.CatchUpScan`
to iterate twice (`Iteartor.Next`) instead of seeking (`Iteartor.Seek`) when it finds
an MVCCMetadata key for an intent and wants to seek past the intent's provisional
value.

> Despite the recent `SeekGE` optimization it is likely slightly cheaper to do
> Next given that we are interleaving many underlying iterators (both
> MVCCIncrementalIterator and intentInterleavingIter).
@nvb nvb force-pushed the nvanbenschoten/refreshTBI branch from 42ba882 to fc3ea91 Compare January 27, 2022 17:26
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel, @stevendanna, and @sumeerbhola)


-- commits, line 102 at r4:

Previously, stevendanna (Steven Danna) wrote…

I suppose that the intuition between these cases getting worse is that these are cases that don't have any short circuiting so we are mostly paying for the overhead of the

Right, these are cases without any short-circuiting and also without any ability to filter out SSTs. So this is demonstrating the overhead of the MVCCIncrementalIterator.

One thing that's interesting to note is that this PR improved cases with short-circuiting even if they have no ability to filter out SSTs. For instance, it improved RefreshRange/random-keys/refresh_window=[50.00,75.00]. The reason for that is because the old use of MVCCIterate scanned 1000 keys at a time (maxKeysPerScan) before detecting the collision. See #66828. With this PR, there's no iterator readahead, so we short-circuit much earlier.


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 32 at r4 (raw file):

Previously, sumeerbhola wrote…

Does this cluster setting exist because we have a few microbenchmarks that show a performance regression?

It wasn't for a specific reason like that. This is just to give us a knob to disable the optimization if we find performance/correctness issues with it in practice. We have a similar knob for the other cases where we use a time-bound iterator.


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 34 at r4 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

nit - no idea whether this string should match the one 2 lines above - should it be _enabled or .enabled?

Done.


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 64 at r4 (raw file):

Previously, sumeerbhola wrote…

What if gc has run and cleaned up the tombstone? Will we incorrectly think that nothing has changed since we've not done the real check that the readable value is what we read previously?

That's interesting. Your question is not specific to the time-bound iterator optimization, right?

Here's an example that demonstrates the hazard:

txn1 reads k1@10
txn2 deletes (tombstones) k1@15
gc @ 20 clears [k1@10, k1@15]
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful when it should not be

This implies that we should be checking the GC threshold against the RefreshFrom timestamp of the RefreshRequest, and not the RefreshTo timestamp. The way to do this would be to add cases for RefreshRange and RefreshRangeRequest to BatchRequest.EarliestActiveTimestamp. I'll open an issue.


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 115 at r4 (raw file):

Previously, sumeerbhola wrote…

do we need to ignore an inline meta too?

Done.


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 121 at r4 (raw file):

Previously, sumeerbhola wrote…

Say the sequence of keys is AA@0, AA@7, AA@4. We have found AA@0 (the intents appear as 0 timestamped keys). SeekGE(AA@6) will go to AA@4.

@nvanbenschoten can we do iter.Next, validate that iter is not exhausted and the timestamp is as expected, and then do another iter.Next and continue? Despite the recent SeekGE optimization added by Bilal it is likely slightly cheaper to do Next given that we are interleaving many underlying iterators (both MVCCIncrementalIterator and intentInterleavingIter).

This logic was adapted from CatchUpIterator, so I made the change in both places in a separate commit.


pkg/kv/kvserver/batcheval/cmd_refresh_range_bench_test.go, line 83 at r4 (raw file):

should we abstract out and share code from here and CatchUpScan so we can more easily add more workloads, by doing so once.

This would have been very valuable while writing this PR. For now, I'll leave this so that the shared code remains identical instead of deviating slightly, but I'd be happy to see the shared code and the earlier code from pkg/storage that it was derived from unified.


pkg/kv/kvserver/batcheval/cmd_refresh_range_test.go, line 259 at r4 (raw file):

Previously, sumeerbhola wrote…

RefreshFrom

Done.

nvb added a commit to nvb/cockroach that referenced this pull request Jan 27, 2022
Noticed by Sumeer in cockroachdb#74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lidorcarmel, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 32 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It wasn't for a specific reason like that. This is just to give us a knob to disable the optimization if we find performance/correctness issues with it in practice. We have a similar knob for the other cases where we use a time-bound iterator.

ack


pkg/kv/kvserver/batcheval/cmd_refresh_range.go, line 64 at r4 (raw file):

That's interesting. Your question is not specific to the time-bound iterator optimization, right?

yep. The general case I was thinking of was anything that allows for updates in (refreshFrom, refreshTo] to be deleted (including the case that there was a tombstone written at > refreshTo). IIUC, your solution will fix that.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jan 28, 2022

bors r=sumeerbhola

@nvb nvb dismissed lidorcarmel’s stale review January 28, 2022 23:53

Addressed comments.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build succeeded:

@nvb nvb deleted the nvanbenschoten/refreshTBI branch January 29, 2022 16:16
nvb added a commit to nvb/cockroach that referenced this pull request Jan 29, 2022
Noticed by Sumeer in cockroachdb#74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.
craig bot pushed a commit that referenced this pull request Jan 29, 2022
72296: kvserver: rebalance ranges to minimize QPS delta among stores  r=aayushshah15 a=aayushshah15

kvserver: rebalance ranges to minimize QPS delta among stores

This commit fixes the regression(s) introduced by
#65379 where we observed replica
thrashing in various workloads (#70396 and #71244).

The following is a description of the differences between the QPS based
rebalancing scheme used in the previous implementation of the store rebalancer
(release-21.2 and before) and the "new" implementation (22.1 and beyond).

**lease rebalancing**
***release 21.2 and before***
QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level
average QPS and computes underfull and overfull thresholds based off of this
average. For each range that the local store has a lease for, the store
rebalancer goroutine checks whether transferring said range's lease away will
bring the local store's QPS below the underfull threshold. If so, it ignores
the range and moves on to the next one. Otherwise, it iterates through the
stores of all the non-leaseholder voting replicas (in ascending order of their
QPS) and checks whether it would be reasonable to transfer the lease away to
such a store. It ensures that the receiving store would not become overfull
after the lease transfer. It checks that the receiving store doesn't have a
replica that's lagging behind the current leaseholder. It checks that the
receiving store is not in violation of lease preferences. Finally, it ensures
that the lease is not on the local store because of access locality
considerations (i.e. because of follow-the-workload).

All of this was bespoke logic that lived in the store rebalancer (using none of
the Allocator's machinery).

***master and this commit***
In #65379, we moved this decision making into the Allocator by adding a new
mode in `Allocator.TransferLeaseTarget` that tries to determine whether
transferring the lease to another voting replica would reduce the qps delta
between the hottest and the coldest stores in the replica set. This commit adds
some padding to this logic by ensuring that the qps difference between the
store relinquishing the lease and the store receiving the lease is at least
200qps. Furthermore, it ensures that the store receiving the lease won't become
significantly hotter than the current leaseholder.

**replica rebalancing**
***release 21.2 and before***
QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing
logic. We first compute a cluster level QPS average, overfull and underfull
thresholds. Based on these thresholds we try to move replicas away from
overfull stores and onto stores that are underfull, all while ensuring that the
receiving stores would not become overfull after the rebalance. A critical
assumption that the store rebalancer made (and still does, in the approach
implemented by this commit) is that follower replicas serve the same traffic as
the leaseholder.

***master and this commit***
The approach implemented by #65379 and refined by this commit tries to leverage
machinery in the Allocator that makes rebalancing decisions that converge load
based statistics per equivalence class. Previously, this machinery was only
used for range count based replica rebalancing (performed by the
`replicateQueue`) but not for qps-based rebalancing. This commit implements a
similar approach to what we do now for lease rebalancing, which is to determine
whether a rebalance action would reduce the qps delta between the hottest and
the coldest store in the equivalence class. This commit adds some safeguards
around this logic by ensuring that the store relinquishing the replica and the
store receiving it differ by at least 200 qps. Furthermore, it ensures that the
replica rebalance would not significantly switch the relative dispositions of
the two stores.

An important thing to note with the 21.2 implementation of the store rebalancer
is that it was making all of its decisions based on cluster-level QPS averages.
This behaves poorly in heterogenously sized / loaded clusters where some
localities are designed to receive more traffic than others. In such clusters,
heavily loaded localities can always be considered "overfull". This usually
means that all stores in such localities would be above the "overfull"
threshold in the cluster. The logic described above would effectively not do
anything since there are no underfull stores to move replicas to.

**Manual testing**
This patch has been stress tested with the follower reads roachtests (~250 iterations of 
`follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of 
`follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been 
stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and 
`..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we
know triggers #31135) a few times and
saw average QPS converge among stores fairly quickly.
![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png)
  

Release note (performance improvement): A set of bugs that rendered QPS-based
lease and replica rebalancing in CRDB 21.2 and prior ineffective under
heterogenously loaded cluster localities has been fixed. Additionally a
limitation which prevented CRDB from effectively alleviating extreme QPS hotspots
from nodes has also been fixed.


75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten

Noticed by Sumeer in #74628.

A Refresh request needs to observe all MVCC versions between its
exclusive RefreshFrom time and its inclusive RefreshTo time. If it were
to permit MVCC GC between these times then it could miss conflicts that
should cause the refresh to fail. This could in turn lead to violations
of serializability. For example:

```
txn1 reads value k1@10
txn2 deletes (tombstones) k1@15
mvcc gc @ 20 clears versions k1@10 and k1@15
txn1 refreshes @ 25, sees no value between (10, 25], refresh successful
```

In the example, the refresh erroneously succeeds because the request is
permitted to evaluate after part of the MVCC history it needs to read
has been GCed. By considering the RefreshFrom time to be the earliest
active timestamp of the request, we avoid this hazard. Instead of being
allowed to evaluate, the refresh request in the example would have hit
a BatchTimestampBeforeGCError.

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: re-enable time-bound iterators for KV requests

5 participants