Skip to content

Conversation

@inkel
Copy link
Contributor

@inkel inkel commented Mar 23, 2021

Hello. While profiling another tool (Grafana's Tanka) I've found that 13% of memory profiling was spent at ast.IdentifierSet.ToOrderedSlice. After checking the code I've found a small change that drastically improves not only memory consumption in a stable way (only 2 allocations per call to ToOrderSlice, regardless the size of the output slice) but drastically improves the performance of this and ToSlice methods.

I've added some benchmarks to test this and found the following: with the code at current master I get these numbers:

goos: darwin
goarch: amd64
pkg: github.com/google/go-jsonnet/ast
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkToOrderedSlice/1_unique_identifiers-8           6445634               175.3 ns/op            40 B/op          2 allocs/op
BenchmarkToOrderedSlice/10_unique_identifiers-8           871934              1356 ns/op             520 B/op          6 allocs/op
BenchmarkToOrderedSlice/100_unique_identifiers-8           60790             19291 ns/op            4104 B/op          9 allocs/op
BenchmarkToOrderedSlice/1000_unique_identifiers-8           3883            260751 ns/op           32785 B/op         12 allocs/op
BenchmarkToOrderedSlice/10000_unique_identifiers-8           343           3523511 ns/op          826040 B/op         21 allocs/op
BenchmarkToOrderedSlice/100000_unique_identifiers-8           22          49308257 ns/op         9248112 B/op         31 allocs/op
BenchmarkToSlice/1_unique_identifiers-8                 11676297               101.8 ns/op            16 B/op          1 allocs/op
BenchmarkToSlice/10_unique_identifiers-8                 1467850               837.8 ns/op           496 B/op          5 allocs/op
BenchmarkToSlice/100_unique_identifiers-8                 245740              4848 ns/op            4080 B/op          8 allocs/op
BenchmarkToSlice/1000_unique_identifiers-8                 31818             38298 ns/op           32752 B/op         11 allocs/op
BenchmarkToSlice/10000_unique_identifiers-8                 1701            620139 ns/op          825977 B/op         20 allocs/op
BenchmarkToSlice/100000_unique_identifiers-8                 138          10201763 ns/op         9247347 B/op         30 allocs/op
PASS
ok      github.com/google/go-jsonnet/ast        18.370s

As you can see, the number of allocations and bytes per operation depends on the size of the underlying map.

With the changes proposed in this PR, I'm getting these numbers instead:

goos: darwin
goarch: amd64
pkg: github.com/google/go-jsonnet/ast
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkToOrderedSlice/1_unique_identifiers-8           7902704               149.7 ns/op            40 B/op          2 allocs/op
BenchmarkToOrderedSlice/10_unique_identifiers-8          1627670               738.4 ns/op           184 B/op          2 allocs/op
BenchmarkToOrderedSlice/100_unique_identifiers-8           75212             15804 ns/op            1816 B/op          2 allocs/op
BenchmarkToOrderedSlice/1000_unique_identifiers-8           5390            220327 ns/op           16413 B/op          2 allocs/op
BenchmarkToOrderedSlice/10000_unique_identifiers-8           426           2853913 ns/op          163883 B/op          2 allocs/op
BenchmarkToOrderedSlice/100000_unique_identifiers-8           32          37886116 ns/op         1606168 B/op          2 allocs/op
BenchmarkToSlice/1_unique_identifiers-8                 14197999                83.31 ns/op           16 B/op          1 allocs/op
BenchmarkToSlice/10_unique_identifiers-8                 4179153               291.0 ns/op           160 B/op          1 allocs/op
BenchmarkToSlice/100_unique_identifiers-8                 499405              2372 ns/op            1792 B/op          1 allocs/op
BenchmarkToSlice/1000_unique_identifiers-8                 47860             25045 ns/op           16384 B/op          1 allocs/op
BenchmarkToSlice/10000_unique_identifiers-8                 5781            222580 ns/op          163842 B/op          1 allocs/op
BenchmarkToSlice/100000_unique_identifiers-8                 532           2284904 ns/op         1605634 B/op          1 allocs/op
PASS
ok      github.com/google/go-jsonnet/ast        17.753s

Performant was improved in all places by orders of magnitude, as reported by benchstat:

name                                        old time/op    new time/op    delta
ToOrderedSlice/1_unique_identifiers-8          200ns _ 9%     149ns _ 2%  -25.30%  (p=0.000 n=19+20)
ToOrderedSlice/10_unique_identifiers-8        1.52_s _16%    0.73_s _ 4%  -51.86%  (p=0.000 n=19+17)
ToOrderedSlice/100_unique_identifiers-8       22.2_s _13%    15.8_s _ 2%  -28.64%  (p=0.000 n=19+20)
ToOrderedSlice/1000_unique_identifiers-8       280_s _ 7%     222_s _ 1%  -20.72%  (p=0.000 n=18+19)
ToOrderedSlice/10000_unique_identifiers-8     3.90ms _10%    2.86ms _ 1%  -26.85%  (p=0.000 n=18+19)
ToOrderedSlice/100000_unique_identifiers-8    61.7ms _20%    38.4ms _ 2%  -37.80%  (p=0.000 n=18+20)
ToSlice/1_unique_identifiers-8                 115ns _ 7%      83ns _ 1%  -27.91%  (p=0.000 n=18+17)
ToSlice/10_unique_identifiers-8                940ns _15%     293ns _ 1%  -68.84%  (p=0.000 n=18+18)
ToSlice/100_unique_identifiers-8              5.48_s _17%    2.40_s _ 6%  -56.23%  (p=0.000 n=19+19)
ToSlice/1000_unique_identifiers-8             44.2_s _16%    25.3_s _ 7%  -42.92%  (p=0.000 n=19+20)
ToSlice/10000_unique_identifiers-8             737_s _17%     222_s _ 2%  -69.83%  (p=0.000 n=19+19)
ToSlice/100000_unique_identifiers-8           10.3ms _13%     2.2ms _ 1%  -78.52%  (p=0.000 n=18+18)

name                                        old alloc/op   new alloc/op   delta
ToOrderedSlice/1_unique_identifiers-8          40.0B _ 0%     40.0B _ 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          520B _ 0%      184B _ 0%  -64.62%  (p=0.000 n=20+20)
ToOrderedSlice/100_unique_identifiers-8       4.10kB _ 0%    1.82kB _ 0%  -55.75%  (p=0.000 n=20+20)
ToOrderedSlice/1000_unique_identifiers-8      32.8kB _ 0%    16.4kB _ 0%  -49.93%  (p=0.000 n=18+18)
ToOrderedSlice/10000_unique_identifiers-8      826kB _ 0%     164kB _ 0%  -80.16%  (p=0.000 n=16+17)
ToOrderedSlice/100000_unique_identifiers-8    9.25MB _ 0%    1.61MB _ 0%  -82.63%  (p=0.000 n=18+17)
ToSlice/1_unique_identifiers-8                 16.0B _ 0%     16.0B _ 0%     ~     (all equal)
ToSlice/10_unique_identifiers-8                 496B _ 0%      160B _ 0%  -67.74%  (p=0.000 n=20+20)
ToSlice/100_unique_identifiers-8              4.08kB _ 0%    1.79kB _ 0%  -56.08%  (p=0.000 n=20+20)
ToSlice/1000_unique_identifiers-8             32.8kB _ 0%    16.4kB _ 0%  -49.98%  (p=0.000 n=20+20)
ToSlice/10000_unique_identifiers-8             826kB _ 0%     164kB _ 0%  -80.16%  (p=0.000 n=20+20)
ToSlice/100000_unique_identifiers-8           9.25MB _ 0%    1.61MB _ 0%  -82.64%  (p=0.000 n=20+20)

name                                        old allocs/op  new allocs/op  delta
ToOrderedSlice/1_unique_identifiers-8           2.00 _ 0%      2.00 _ 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          6.00 _ 0%      2.00 _ 0%  -66.67%  (p=0.000 n=20+20)
ToOrderedSlice/100_unique_identifiers-8         9.00 _ 0%      2.00 _ 0%  -77.78%  (p=0.000 n=20+20)
ToOrderedSlice/1000_unique_identifiers-8        12.0 _ 0%       2.0 _ 0%  -83.33%  (p=0.000 n=20+20)
ToOrderedSlice/10000_unique_identifiers-8       21.0 _ 0%       2.0 _ 0%  -90.48%  (p=0.000 n=20+20)
ToOrderedSlice/100000_unique_identifiers-8      31.0 _ 0%       2.0 _ 0%  -93.55%  (p=0.000 n=20+20)
ToSlice/1_unique_identifiers-8                  1.00 _ 0%      1.00 _ 0%     ~     (all equal)
ToSlice/10_unique_identifiers-8                 5.00 _ 0%      1.00 _ 0%  -80.00%  (p=0.000 n=20+20)
ToSlice/100_unique_identifiers-8                8.00 _ 0%      1.00 _ 0%  -87.50%  (p=0.000 n=20+20)
ToSlice/1000_unique_identifiers-8               11.0 _ 0%       1.0 _ 0%  -90.91%  (p=0.000 n=20+20)
ToSlice/10000_unique_identifiers-8              20.0 _ 0%       1.0 _ 0%  -95.00%  (p=0.000 n=20+20)
ToSlice/100000_unique_identifiers-8             30.0 _ 0%       1.0 _ 0%  -96.67%  (p=0.000 n=20+20)

I recompiled my local Tanka using these improvements and now when profiling output this method isn't listed anymore, not even in the top 10 output.

@google-cla google-cla bot added the cla: yes label Mar 23, 2021
inkel added 2 commits March 23, 2021 14:28
This will be useful later on when implementing the performance
improvements to compare how much we gain.

Signed-off-by: Leandro López <leandro.lopez@grafana.com>
We know for sure how big the resulting slice will be, so instead of
having Go perform checks and reallocations whenever the slice grows
too small by using `append` we calculate the resulting size earlier
and directly index each element into the output slice. This result in
only two allocations all the time, no matter how big the resulting
slice will be. With it also come big performance improvements:

name                                        old time/op    new time/op    delta
ToOrderedSlice/1_unique_identifiers-8          173ns ±10%     150ns ± 4%  -13.16%  (p=0.000 n=10+9)
ToOrderedSlice/10_unique_identifiers-8        1.24µs ± 6%    0.51µs ± 5%  -58.95%  (p=0.000 n=9+9)
ToOrderedSlice/100_unique_identifiers-8       20.4µs ±20%     4.0µs ± 5%  -80.25%  (p=0.000 n=10+9)
ToOrderedSlice/1000_unique_identifiers-8       253µs ± 9%      40µs ± 5%  -84.19%  (p=0.000 n=10+10)
ToOrderedSlice/10000_unique_identifiers-8     3.44ms ± 9%    0.37ms ± 5%  -89.32%  (p=0.000 n=9+9)
ToOrderedSlice/100000_unique_identifiers-8    48.6ms ±13%     3.6ms ± 4%  -92.49%  (p=0.000 n=10+10)

name                                        old alloc/op   new alloc/op   delta
ToOrderedSlice/1_unique_identifiers-8          40.0B ± 0%     40.0B ± 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          520B ± 0%      184B ± 0%  -64.62%  (p=0.000 n=10+10)
ToOrderedSlice/100_unique_identifiers-8       4.10kB ± 0%    1.82kB ± 0%  -55.75%  (p=0.000 n=10+10)
ToOrderedSlice/1000_unique_identifiers-8      32.8kB ± 0%    16.4kB ± 0%  -49.95%  (p=0.000 n=10+10)
ToOrderedSlice/10000_unique_identifiers-8      826kB ± 0%     164kB ± 0%  -80.16%  (p=0.000 n=10+10)
ToOrderedSlice/100000_unique_identifiers-8    9.25MB ± 0%    1.61MB ± 0%  -82.64%  (p=0.000 n=10+10)

name                                        old allocs/op  new allocs/op  delta
ToOrderedSlice/1_unique_identifiers-8           2.00 ± 0%      2.00 ± 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          6.00 ± 0%      2.00 ± 0%  -66.67%  (p=0.000 n=10+10)
ToOrderedSlice/100_unique_identifiers-8         9.00 ± 0%      2.00 ± 0%  -77.78%  (p=0.000 n=10+10)
ToOrderedSlice/1000_unique_identifiers-8        12.0 ± 0%       2.0 ± 0%  -83.33%  (p=0.000 n=10+10)
ToOrderedSlice/10000_unique_identifiers-8       21.0 ± 0%       2.0 ± 0%  -90.48%  (p=0.000 n=10+10)
ToOrderedSlice/100000_unique_identifiers-8      31.0 ± 0%       2.0 ± 0%  -93.55%  (p=0.000 n=10+10)

For ToSLice we use the same changes except the sort operation. In this
case the results are even more impressive: allocations are stable at
just 1 per operation, regardless of set size, and performance is
orders of magnitude faster:

name                                 old time/op    new time/op    delta
ToSlice/1_unique_identifiers-8         83.0ns ± 1%    92.4ns ± 0%    +11.37%  (p=0.000 n=9+8)
ToSlice/10_unique_identifiers-8         295ns ± 4%     749ns ± 3%   +154.21%  (p=0.000 n=10+10)
ToSlice/100_unique_identifiers-8       2.33µs ± 2%    4.28µs ± 2%    +83.62%  (p=0.000 n=9+8)
ToSlice/1000_unique_identifiers-8      25.4µs ±10%    34.7µs ± 1%    +36.51%  (p=0.000 n=10+10)
ToSlice/10000_unique_identifiers-8      215µs ± 2%     543µs ± 1%   +152.15%  (p=0.000 n=9+10)
ToSlice/100000_unique_identifiers-8    2.05ms ± 2%    7.17ms ± 1%   +249.53%  (p=0.000 n=9+9)

name                                 old alloc/op   new alloc/op   delta
ToSlice/1_unique_identifiers-8          16.0B ± 0%     16.0B ± 0%       ~     (all equal)
ToSlice/10_unique_identifiers-8          160B ± 0%      496B ± 0%   +210.00%  (p=0.000 n=10+10)
ToSlice/100_unique_identifiers-8       1.79kB ± 0%    4.08kB ± 0%   +127.68%  (p=0.000 n=10+10)
ToSlice/1000_unique_identifiers-8      16.4kB ± 0%    32.8kB ± 0%    +99.90%  (p=0.000 n=10+10)
ToSlice/10000_unique_identifiers-8      164kB ± 0%     826kB ± 0%   +404.12%  (p=0.000 n=8+10)
ToSlice/100000_unique_identifiers-8    1.61MB ± 0%    9.25MB ± 0%   +475.93%  (p=0.000 n=10+10)

name                                 old allocs/op  new allocs/op  delta
ToSlice/1_unique_identifiers-8           1.00 ± 0%      1.00 ± 0%       ~     (all equal)
ToSlice/10_unique_identifiers-8          1.00 ± 0%      5.00 ± 0%   +400.00%  (p=0.000 n=10+10)
ToSlice/100_unique_identifiers-8         1.00 ± 0%      8.00 ± 0%   +700.00%  (p=0.000 n=10+10)
ToSlice/1000_unique_identifiers-8        1.00 ± 0%     11.00 ± 0%  +1000.00%  (p=0.000 n=10+10)
ToSlice/10000_unique_identifiers-8       1.00 ± 0%     20.00 ± 0%  +1900.00%  (p=0.000 n=10+10)
ToSlice/100000_unique_identifiers-8      1.00 ± 0%     30.00 ± 0%  +2900.00%  (p=0.000 n=10+10)

Signed-off-by: Leandro López <leandro.lopez@grafana.com>
@inkel inkel force-pushed the identifierset-performance-improvements branch from aaf82f5 to e7d711f Compare March 23, 2021 17:29
@sbarzowski
Copy link
Contributor

Nice!

@sbarzowski sbarzowski merged commit eeca44c into google:master Mar 25, 2021
@inkel inkel deleted the identifierset-performance-improvements branch March 25, 2021 15:03
@inkel
Copy link
Contributor Author

inkel commented Mar 25, 2021

Thank you @sbarzowski !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants