Optimize SlidingTimeWindowMovingAverages sumBuckets#4936
Conversation
`SlidingTimeWindowMovingAverages` `sumBuckets()` method could be optimized to perform indexed list access and remove allocations as it currently allocates a `LongAdder` as well as one or two streams. If one is using the 1, 5, or 15 minute rates on hot paths, these can be unnecessary expensive allocations as well as less optimized computations. We can avoid the allocations completely and accumulate the sum directly to a `long` via optimized direct indexed list access. See upstream PR dropwizard/metrics#4936
| * time window (i.e. 15 minutes, see TIME_WINDOW_DURATION_MINUTES) | ||
| */ | ||
| private ArrayList<LongAdder> buckets; | ||
| private final List<LongAdder> buckets; |
There was a problem hiding this comment.
This is a fixed size and could just be an array.
| * time window (i.e. 15 minutes, see TIME_WINDOW_DURATION_MINUTES) | ||
| */ | ||
| private ArrayList<LongAdder> buckets; | ||
| private final LongAdder[] buckets; |
There was a problem hiding this comment.
a separate optimization would be to avoid the LongAdders and use AtomicLongArray
| private final LongAdder[] buckets; | |
| private final AtomicLongArray buckets; |
There was a problem hiding this comment.
using AtomicLongArray cuts optimized getM1Rate from ~60ns to ~45ns, but there's additional overhead on the concurrent update side so going to hold off on that change.
Benchmark (recordings) (type) Mode Cnt Score Error Units
MovingAverageBenchmarks.getM1Rate 10 SlidingTimeWindowMovingAverages avgt 20 1366.241 ± 24.341 ns/op
MovingAverageBenchmarks.getM1Rate 10 OptimizedSlidingTimeWindowMovingAverages avgt 20 43.180 ± 0.153 ns/op
MovingAverageBenchmarks.getM1Rate 1000 SlidingTimeWindowMovingAverages avgt 20 1377.545 ± 65.159 ns/op
MovingAverageBenchmarks.getM1Rate 1000 OptimizedSlidingTimeWindowMovingAverages avgt 20 43.967 ± 2.058 ns/op
This reverts commit 29b0664.
|
Thanks for opening this PR @schlosna ! The reasoning about doing fewer allocations and more performant array indexing makes sense, and the benchmarks show a clear performance improvement. It does not seem like a reduction in code readability either. I also checked if LongAdder does anything special for long overflow and didn't see anything, so I believe this is behavior-equivalent. Requesting review from maintainer @joschi |
`SlidingTimeWindowMovingAverages` `sumBuckets()` method could be optimized to perform indexed list access and remove allocations as it currently allocates a `LongAdder` as well as one or two streams. If one is using the 1, 5, or 15 minute rates on hot paths, these can be unnecessary expensive allocations as well as less optimized computations. We can avoid the allocations completely and accumulate the sum directly to a `long` via optimized direct indexed list access. [MovingAverageBenchmarks](https://github.com/palantir/tritium/blob/davids/OptimizedSlidingTimeWindowMovingAverages/tritium-jmh/src/jmh/java/com/palantir/tritium/microbenchmarks/MovingAverageBenchmarks.java) demonstrates the difference in implementations in both execution time (22x faster) and allocations: ``` Benchmark (recordings) (type) Mode Cnt Score Error Units MovingAverageBenchmarks.getM1Rate 10 SlidingTimeWindowMovingAverages avgt 20 1364.969 ± 12.040 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 10 SlidingTimeWindowMovingAverages avgt 20 688.037 ± 0.001 B/op MovingAverageBenchmarks.getM1Rate 10 OptimizedSlidingTimeWindowMovingAverages avgt 20 59.182 ± 0.343 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 10 OptimizedSlidingTimeWindowMovingAverages avgt 20 0.002 ± 0.001 B/op MovingAverageBenchmarks.getM1Rate 1000 SlidingTimeWindowMovingAverages avgt 20 1401.864 ± 107.134 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 1000 SlidingTimeWindowMovingAverages avgt 20 688.038 ± 0.003 B/op MovingAverageBenchmarks.getM1Rate 1000 OptimizedSlidingTimeWindowMovingAverages avgt 20 61.157 ± 1.393 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 1000 OptimizedSlidingTimeWindowMovingAverages avgt 20 0.002 ± 0.001 B/op ```
`SlidingTimeWindowMovingAverages` `sumBuckets()` method could be optimized to perform indexed list access and remove allocations as it currently allocates a `LongAdder` as well as one or two streams. If one is using the 1, 5, or 15 minute rates on hot paths, these can be unnecessary expensive allocations as well as less optimized computations. We can avoid the allocations completely and accumulate the sum directly to a `long` via optimized direct indexed list access. [MovingAverageBenchmarks](https://github.com/palantir/tritium/blob/davids/OptimizedSlidingTimeWindowMovingAverages/tritium-jmh/src/jmh/java/com/palantir/tritium/microbenchmarks/MovingAverageBenchmarks.java) demonstrates the difference in implementations in both execution time (22x faster) and allocations: ``` Benchmark (recordings) (type) Mode Cnt Score Error Units MovingAverageBenchmarks.getM1Rate 10 SlidingTimeWindowMovingAverages avgt 20 1364.969 ± 12.040 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 10 SlidingTimeWindowMovingAverages avgt 20 688.037 ± 0.001 B/op MovingAverageBenchmarks.getM1Rate 10 OptimizedSlidingTimeWindowMovingAverages avgt 20 59.182 ± 0.343 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 10 OptimizedSlidingTimeWindowMovingAverages avgt 20 0.002 ± 0.001 B/op MovingAverageBenchmarks.getM1Rate 1000 SlidingTimeWindowMovingAverages avgt 20 1401.864 ± 107.134 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 1000 SlidingTimeWindowMovingAverages avgt 20 688.038 ± 0.003 B/op MovingAverageBenchmarks.getM1Rate 1000 OptimizedSlidingTimeWindowMovingAverages avgt 20 61.157 ± 1.393 ns/op MovingAverageBenchmarks.getM1Rate:gc.alloc.rate.norm 1000 OptimizedSlidingTimeWindowMovingAverages avgt 20 0.002 ± 0.001 B/op ``` Co-authored-by: David Schlosnagle <davids@palantir.com>
SlidingTimeWindowMovingAveragessumBuckets()method could be optimized to perform indexed list access and remove allocations as it currently allocates aLongAdderas well as one or two streams. If one is using the 1, 5, or 15 minute rates on hot paths, these can be unnecessary expensive allocations as well as less optimized computations.We can avoid the allocations completely and accumulate the sum directly to a
longvia optimized direct indexed list access.MovingAverageBenchmarks demonstrates the difference in implementations in both execution time (22x faster) and allocations: