IdentifierSet performance improvements #525

inkel · 2021-03-23T17:26:14Z

Hello. While profiling another tool (Grafana's Tanka) I've found that 13% of memory profiling was spent at ast.IdentifierSet.ToOrderedSlice. After checking the code I've found a small change that drastically improves not only memory consumption in a stable way (only 2 allocations per call to ToOrderSlice, regardless the size of the output slice) but drastically improves the performance of this and ToSlice methods.

I've added some benchmarks to test this and found the following: with the code at current master I get these numbers:

goos: darwin
goarch: amd64
pkg: github.com/google/go-jsonnet/ast
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkToOrderedSlice/1_unique_identifiers-8           6445634               175.3 ns/op            40 B/op          2 allocs/op
BenchmarkToOrderedSlice/10_unique_identifiers-8           871934              1356 ns/op             520 B/op          6 allocs/op
BenchmarkToOrderedSlice/100_unique_identifiers-8           60790             19291 ns/op            4104 B/op          9 allocs/op
BenchmarkToOrderedSlice/1000_unique_identifiers-8           3883            260751 ns/op           32785 B/op         12 allocs/op
BenchmarkToOrderedSlice/10000_unique_identifiers-8           343           3523511 ns/op          826040 B/op         21 allocs/op
BenchmarkToOrderedSlice/100000_unique_identifiers-8           22          49308257 ns/op         9248112 B/op         31 allocs/op
BenchmarkToSlice/1_unique_identifiers-8                 11676297               101.8 ns/op            16 B/op          1 allocs/op
BenchmarkToSlice/10_unique_identifiers-8                 1467850               837.8 ns/op           496 B/op          5 allocs/op
BenchmarkToSlice/100_unique_identifiers-8                 245740              4848 ns/op            4080 B/op          8 allocs/op
BenchmarkToSlice/1000_unique_identifiers-8                 31818             38298 ns/op           32752 B/op         11 allocs/op
BenchmarkToSlice/10000_unique_identifiers-8                 1701            620139 ns/op          825977 B/op         20 allocs/op
BenchmarkToSlice/100000_unique_identifiers-8                 138          10201763 ns/op         9247347 B/op         30 allocs/op
PASS
ok      github.com/google/go-jsonnet/ast        18.370s

As you can see, the number of allocations and bytes per operation depends on the size of the underlying map.

With the changes proposed in this PR, I'm getting these numbers instead:

goos: darwin
goarch: amd64
pkg: github.com/google/go-jsonnet/ast
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkToOrderedSlice/1_unique_identifiers-8           7902704               149.7 ns/op            40 B/op          2 allocs/op
BenchmarkToOrderedSlice/10_unique_identifiers-8          1627670               738.4 ns/op           184 B/op          2 allocs/op
BenchmarkToOrderedSlice/100_unique_identifiers-8           75212             15804 ns/op            1816 B/op          2 allocs/op
BenchmarkToOrderedSlice/1000_unique_identifiers-8           5390            220327 ns/op           16413 B/op          2 allocs/op
BenchmarkToOrderedSlice/10000_unique_identifiers-8           426           2853913 ns/op          163883 B/op          2 allocs/op
BenchmarkToOrderedSlice/100000_unique_identifiers-8           32          37886116 ns/op         1606168 B/op          2 allocs/op
BenchmarkToSlice/1_unique_identifiers-8                 14197999                83.31 ns/op           16 B/op          1 allocs/op
BenchmarkToSlice/10_unique_identifiers-8                 4179153               291.0 ns/op           160 B/op          1 allocs/op
BenchmarkToSlice/100_unique_identifiers-8                 499405              2372 ns/op            1792 B/op          1 allocs/op
BenchmarkToSlice/1000_unique_identifiers-8                 47860             25045 ns/op           16384 B/op          1 allocs/op
BenchmarkToSlice/10000_unique_identifiers-8                 5781            222580 ns/op          163842 B/op          1 allocs/op
BenchmarkToSlice/100000_unique_identifiers-8                 532           2284904 ns/op         1605634 B/op          1 allocs/op
PASS
ok      github.com/google/go-jsonnet/ast        17.753s

Performant was improved in all places by orders of magnitude, as reported by benchstat:

name                                        old time/op    new time/op    delta
ToOrderedSlice/1_unique_identifiers-8          200ns _ 9%     149ns _ 2%  -25.30%  (p=0.000 n=19+20)
ToOrderedSlice/10_unique_identifiers-8        1.52_s _16%    0.73_s _ 4%  -51.86%  (p=0.000 n=19+17)
ToOrderedSlice/100_unique_identifiers-8       22.2_s _13%    15.8_s _ 2%  -28.64%  (p=0.000 n=19+20)
ToOrderedSlice/1000_unique_identifiers-8       280_s _ 7%     222_s _ 1%  -20.72%  (p=0.000 n=18+19)
ToOrderedSlice/10000_unique_identifiers-8     3.90ms _10%    2.86ms _ 1%  -26.85%  (p=0.000 n=18+19)
ToOrderedSlice/100000_unique_identifiers-8    61.7ms _20%    38.4ms _ 2%  -37.80%  (p=0.000 n=18+20)
ToSlice/1_unique_identifiers-8                 115ns _ 7%      83ns _ 1%  -27.91%  (p=0.000 n=18+17)
ToSlice/10_unique_identifiers-8                940ns _15%     293ns _ 1%  -68.84%  (p=0.000 n=18+18)
ToSlice/100_unique_identifiers-8              5.48_s _17%    2.40_s _ 6%  -56.23%  (p=0.000 n=19+19)
ToSlice/1000_unique_identifiers-8             44.2_s _16%    25.3_s _ 7%  -42.92%  (p=0.000 n=19+20)
ToSlice/10000_unique_identifiers-8             737_s _17%     222_s _ 2%  -69.83%  (p=0.000 n=19+19)
ToSlice/100000_unique_identifiers-8           10.3ms _13%     2.2ms _ 1%  -78.52%  (p=0.000 n=18+18)

name                                        old alloc/op   new alloc/op   delta
ToOrderedSlice/1_unique_identifiers-8          40.0B _ 0%     40.0B _ 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          520B _ 0%      184B _ 0%  -64.62%  (p=0.000 n=20+20)
ToOrderedSlice/100_unique_identifiers-8       4.10kB _ 0%    1.82kB _ 0%  -55.75%  (p=0.000 n=20+20)
ToOrderedSlice/1000_unique_identifiers-8      32.8kB _ 0%    16.4kB _ 0%  -49.93%  (p=0.000 n=18+18)
ToOrderedSlice/10000_unique_identifiers-8      826kB _ 0%     164kB _ 0%  -80.16%  (p=0.000 n=16+17)
ToOrderedSlice/100000_unique_identifiers-8    9.25MB _ 0%    1.61MB _ 0%  -82.63%  (p=0.000 n=18+17)
ToSlice/1_unique_identifiers-8                 16.0B _ 0%     16.0B _ 0%     ~     (all equal)
ToSlice/10_unique_identifiers-8                 496B _ 0%      160B _ 0%  -67.74%  (p=0.000 n=20+20)
ToSlice/100_unique_identifiers-8              4.08kB _ 0%    1.79kB _ 0%  -56.08%  (p=0.000 n=20+20)
ToSlice/1000_unique_identifiers-8             32.8kB _ 0%    16.4kB _ 0%  -49.98%  (p=0.000 n=20+20)
ToSlice/10000_unique_identifiers-8             826kB _ 0%     164kB _ 0%  -80.16%  (p=0.000 n=20+20)
ToSlice/100000_unique_identifiers-8           9.25MB _ 0%    1.61MB _ 0%  -82.64%  (p=0.000 n=20+20)

name                                        old allocs/op  new allocs/op  delta
ToOrderedSlice/1_unique_identifiers-8           2.00 _ 0%      2.00 _ 0%     ~     (all equal)
ToOrderedSlice/10_unique_identifiers-8          6.00 _ 0%      2.00 _ 0%  -66.67%  (p=0.000 n=20+20)
ToOrderedSlice/100_unique_identifiers-8         9.00 _ 0%      2.00 _ 0%  -77.78%  (p=0.000 n=20+20)
ToOrderedSlice/1000_unique_identifiers-8        12.0 _ 0%       2.0 _ 0%  -83.33%  (p=0.000 n=20+20)
ToOrderedSlice/10000_unique_identifiers-8       21.0 _ 0%       2.0 _ 0%  -90.48%  (p=0.000 n=20+20)
ToOrderedSlice/100000_unique_identifiers-8      31.0 _ 0%       2.0 _ 0%  -93.55%  (p=0.000 n=20+20)
ToSlice/1_unique_identifiers-8                  1.00 _ 0%      1.00 _ 0%     ~     (all equal)
ToSlice/10_unique_identifiers-8                 5.00 _ 0%      1.00 _ 0%  -80.00%  (p=0.000 n=20+20)
ToSlice/100_unique_identifiers-8                8.00 _ 0%      1.00 _ 0%  -87.50%  (p=0.000 n=20+20)
ToSlice/1000_unique_identifiers-8               11.0 _ 0%       1.0 _ 0%  -90.91%  (p=0.000 n=20+20)
ToSlice/10000_unique_identifiers-8              20.0 _ 0%       1.0 _ 0%  -95.00%  (p=0.000 n=20+20)
ToSlice/100000_unique_identifiers-8             30.0 _ 0%       1.0 _ 0%  -96.67%  (p=0.000 n=20+20)

I recompiled my local Tanka using these improvements and now when profiling output this method isn't listed anymore, not even in the top 10 output.

This will be useful later on when implementing the performance improvements to compare how much we gain. Signed-off-by: Leandro López <leandro.lopez@grafana.com>

We know for sure how big the resulting slice will be, so instead of having Go perform checks and reallocations whenever the slice grows too small by using `append` we calculate the resulting size earlier and directly index each element into the output slice. This result in only two allocations all the time, no matter how big the resulting slice will be. With it also come big performance improvements: name old time/op new time/op delta ToOrderedSlice/1_unique_identifiers-8 173ns ±10% 150ns ± 4% -13.16% (p=0.000 n=10+9) ToOrderedSlice/10_unique_identifiers-8 1.24µs ± 6% 0.51µs ± 5% -58.95% (p=0.000 n=9+9) ToOrderedSlice/100_unique_identifiers-8 20.4µs ±20% 4.0µs ± 5% -80.25% (p=0.000 n=10+9) ToOrderedSlice/1000_unique_identifiers-8 253µs ± 9% 40µs ± 5% -84.19% (p=0.000 n=10+10) ToOrderedSlice/10000_unique_identifiers-8 3.44ms ± 9% 0.37ms ± 5% -89.32% (p=0.000 n=9+9) ToOrderedSlice/100000_unique_identifiers-8 48.6ms ±13% 3.6ms ± 4% -92.49% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ToOrderedSlice/1_unique_identifiers-8 40.0B ± 0% 40.0B ± 0% ~ (all equal) ToOrderedSlice/10_unique_identifiers-8 520B ± 0% 184B ± 0% -64.62% (p=0.000 n=10+10) ToOrderedSlice/100_unique_identifiers-8 4.10kB ± 0% 1.82kB ± 0% -55.75% (p=0.000 n=10+10) ToOrderedSlice/1000_unique_identifiers-8 32.8kB ± 0% 16.4kB ± 0% -49.95% (p=0.000 n=10+10) ToOrderedSlice/10000_unique_identifiers-8 826kB ± 0% 164kB ± 0% -80.16% (p=0.000 n=10+10) ToOrderedSlice/100000_unique_identifiers-8 9.25MB ± 0% 1.61MB ± 0% -82.64% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ToOrderedSlice/1_unique_identifiers-8 2.00 ± 0% 2.00 ± 0% ~ (all equal) ToOrderedSlice/10_unique_identifiers-8 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=10+10) ToOrderedSlice/100_unique_identifiers-8 9.00 ± 0% 2.00 ± 0% -77.78% (p=0.000 n=10+10) ToOrderedSlice/1000_unique_identifiers-8 12.0 ± 0% 2.0 ± 0% -83.33% (p=0.000 n=10+10) ToOrderedSlice/10000_unique_identifiers-8 21.0 ± 0% 2.0 ± 0% -90.48% (p=0.000 n=10+10) ToOrderedSlice/100000_unique_identifiers-8 31.0 ± 0% 2.0 ± 0% -93.55% (p=0.000 n=10+10) For ToSLice we use the same changes except the sort operation. In this case the results are even more impressive: allocations are stable at just 1 per operation, regardless of set size, and performance is orders of magnitude faster: name old time/op new time/op delta ToSlice/1_unique_identifiers-8 83.0ns ± 1% 92.4ns ± 0% +11.37% (p=0.000 n=9+8) ToSlice/10_unique_identifiers-8 295ns ± 4% 749ns ± 3% +154.21% (p=0.000 n=10+10) ToSlice/100_unique_identifiers-8 2.33µs ± 2% 4.28µs ± 2% +83.62% (p=0.000 n=9+8) ToSlice/1000_unique_identifiers-8 25.4µs ±10% 34.7µs ± 1% +36.51% (p=0.000 n=10+10) ToSlice/10000_unique_identifiers-8 215µs ± 2% 543µs ± 1% +152.15% (p=0.000 n=9+10) ToSlice/100000_unique_identifiers-8 2.05ms ± 2% 7.17ms ± 1% +249.53% (p=0.000 n=9+9) name old alloc/op new alloc/op delta ToSlice/1_unique_identifiers-8 16.0B ± 0% 16.0B ± 0% ~ (all equal) ToSlice/10_unique_identifiers-8 160B ± 0% 496B ± 0% +210.00% (p=0.000 n=10+10) ToSlice/100_unique_identifiers-8 1.79kB ± 0% 4.08kB ± 0% +127.68% (p=0.000 n=10+10) ToSlice/1000_unique_identifiers-8 16.4kB ± 0% 32.8kB ± 0% +99.90% (p=0.000 n=10+10) ToSlice/10000_unique_identifiers-8 164kB ± 0% 826kB ± 0% +404.12% (p=0.000 n=8+10) ToSlice/100000_unique_identifiers-8 1.61MB ± 0% 9.25MB ± 0% +475.93% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ToSlice/1_unique_identifiers-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) ToSlice/10_unique_identifiers-8 1.00 ± 0% 5.00 ± 0% +400.00% (p=0.000 n=10+10) ToSlice/100_unique_identifiers-8 1.00 ± 0% 8.00 ± 0% +700.00% (p=0.000 n=10+10) ToSlice/1000_unique_identifiers-8 1.00 ± 0% 11.00 ± 0% +1000.00% (p=0.000 n=10+10) ToSlice/10000_unique_identifiers-8 1.00 ± 0% 20.00 ± 0% +1900.00% (p=0.000 n=10+10) ToSlice/100000_unique_identifiers-8 1.00 ± 0% 30.00 ± 0% +2900.00% (p=0.000 n=10+10) Signed-off-by: Leandro López <leandro.lopez@grafana.com>

sbarzowski · 2021-03-25T14:44:04Z

Nice!

inkel · 2021-03-25T15:04:19Z

Thank you @sbarzowski !

google-cla bot added the cla: yes label Mar 23, 2021

inkel added 2 commits March 23, 2021 14:28

Add IdentifierSet.ToOrderedSlice and ToSlice benchmarks

e4b3c17

This will be useful later on when implementing the performance improvements to compare how much we gain. Signed-off-by: Leandro López <leandro.lopez@grafana.com>

inkel force-pushed the identifierset-performance-improvements branch from aaf82f5 to e7d711f Compare March 23, 2021 17:29

sbarzowski merged commit eeca44c into google:master Mar 25, 2021

inkel deleted the identifierset-performance-improvements branch March 25, 2021 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IdentifierSet performance improvements #525

IdentifierSet performance improvements #525

Uh oh!

inkel commented Mar 23, 2021 •

edited

Loading

Uh oh!

sbarzowski commented Mar 25, 2021

Uh oh!

inkel commented Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IdentifierSet performance improvements #525

IdentifierSet performance improvements #525

Uh oh!

Conversation

inkel commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbarzowski commented Mar 25, 2021

Uh oh!

inkel commented Mar 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

inkel commented Mar 23, 2021 •

edited

Loading