crypto: Proof of Concept for iterative version of SimpleHashFromByteSlices (#2611)#3530
Conversation
(tendermint#2611) had suggested that an iterative version of SimpleHashFromByteSlice would be faster, presumably because we can envision some overhead accumulating from stack frames and function calls. Additionally, a recursive algorithm risks hitting the stack limit and causing a stack overflow should the tree be too large. Provided here is an iterative alternative, a simple test to assert correctness and a benchmark. On the performance side, there appears to be no overall difference: ``` BenchmarkSimpleHashAlternatives/recursive-4 20000 77677 ns/op BenchmarkSimpleHashAlternatives/iterative-4 20000 76802 ns/op ``` On the surface it might seem that the additional overhead is due to the different allocation patterns of the implementations. The recursive version uses a single `[][]byte` slices which it then re-slices at each level of the tree. The iterative version reproduces `[][]byte` once within the function and then rewrites sub-slices of that array at each level of the tree. Eexperimenting by modifying the code to simply calculate the hash and not store the result show little to no difference in performance. These preliminary results suggest: 1. The performance of the current implementation is pretty good 2. Go has low overhead for recursive functions 3. The performance of the SimpleHashFromByteSlice routine is dominated by the actual hashing of data Although this work is in no way exhaustive, point tendermint#3 suggests that optimizations of this routine would need to take an alternative approach to make significant improvements on the current performance. Finally, considering that the recursive implementation is easier to read, it might not be worthwhile to switch to a less intuitive implementation for so little benefit.
Codecov Report
@@ Coverage Diff @@
## develop #3530 +/- ##
=========================================
+ Coverage 64.09% 64.2% +0.1%
=========================================
Files 213 213
Lines 17359 17387 +28
=========================================
+ Hits 11127 11164 +37
+ Misses 5304 5303 -1
+ Partials 928 920 -8
|
|
Cool analysis 🍪 Thanks for sharing! If you want to go deeper, it may be interesting to compare machine code for these two versions.
Agree here 👍 |
|
Very cool indeed! Thanks for the extensive write-up, too. Referencing #3313 which suggests there is a memory leak in crypto/merkle. It might be worth to revisit that issue with the iterative code, too. It might also be interesting to experiment with different number of elements (currently 100) and see if there are any differences with much larger (or very small) number of items. |
|
Thanks for the great feedback 🙇 Regarding the tree size, I had hoped that a bigger tree (and deeper stacks) would produce a performance gains for the iterative version but that did not turn out to be the case (up to tree size 1,000,000) Also, edited the PR description to include cpu profiles which seem to provide some evidence that the |
xla
left a comment
There was a problem hiding this comment.
👍
🐙
💃
Great contribution, thanks so much for diving into this and sharing your findings.
…lices (tendermint#2611) (tendermint#3530) (tendermint#2611) had suggested that an iterative version of SimpleHashFromByteSlice would be faster, presumably because we can envision some overhead accumulating from stack frames and function calls. Additionally, a recursive algorithm risks hitting the stack limit and causing a stack overflow should the tree be too large. Provided here is an iterative alternative, a simple test to assert correctness and a benchmark. On the performance side, there appears to be no overall difference: ``` BenchmarkSimpleHashAlternatives/recursive-4 20000 77677 ns/op BenchmarkSimpleHashAlternatives/iterative-4 20000 76802 ns/op ``` On the surface it might seem that the additional overhead is due to the different allocation patterns of the implementations. The recursive version uses a single `[][]byte` slices which it then re-slices at each level of the tree. The iterative version reproduces `[][]byte` once within the function and then rewrites sub-slices of that array at each level of the tree. Eexperimenting by modifying the code to simply calculate the hash and not store the result show little to no difference in performance. These preliminary results suggest: 1. The performance of the current implementation is pretty good 2. Go has low overhead for recursive functions 3. The performance of the SimpleHashFromByteSlice routine is dominated by the actual hashing of data Although this work is in no way exhaustive, point tendermint#3 suggests that optimizations of this routine would need to take an alternative approach to make significant improvements on the current performance. Finally, considering that the recursive implementation is easier to read, it might not be worthwhile to switch to a less intuitive implementation for so little benefit. * re-add slice re-writing * [crypto] Document SimpleHashFromByteSlicesIterative
…ort tendermint#3530) (tendermint#3559) Running `make all` would cause an error: `make: *** No rule to make target 'check', needed by 'all'. Stop.` The `check` target has already removed, let's delete it<hr>This is an automatic backport of pull request tendermint#3530 done by [Mergify](https://mergify.com). Co-authored-by: Halimao <1065621723@qq.com>
…ort tendermint#3530) (tendermint#3561) Running `make all` would cause an error: `make: *** No rule to make target 'check', needed by 'all'. Stop.` The `check` target has already removed, let's delete it<hr>This is an automatic backport of pull request tendermint#3530 done by [Mergify](https://mergify.com). --------- Co-authored-by: Halimao <1065621723@qq.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(#2611) had suggested that an iterative version of
SimpleHashFromByteSlice would be faster, presumably because
we can envision some overhead accumulating from stack
frames and function calls. Additionally, a recursive algorithm risks
hitting the stack limit and causing a stack overflow should the tree
be too large.
Provided here is an iterative alternative, a simple test to assert
correctness and a benchmark. On the performance side, there appears to
be no overall difference:
On the surface it might seem that the additional overhead is due to
the different allocation patterns of the implementations. The recursive
version uses a single
[][]byteslices which it then re-slices at each level of the tree.The iterative version reproduces
[][]byteonce within the function andthen rewrites sub-slices of that array at each level of the tree.
Experimenting by modifying the code to simply calculate the
hash and not store the result show little to no difference in performance.
These preliminary results suggest:
by the actual hashing of data
Although this work is in no way exhaustive, point #3 suggests that
optimization of this routine would need to take an alternative
approach to make significant improvements on the current performance.
Finally, considering that the recursive implementation is easier to
read, it might not be worthwhile to switch to a less intuitive
implementation for so little benefit.
Follow up based on comments:
Indeed! I'm not actually familiar with machine code so I just did a quick profile to get a sense of where the cpu was spending it's time.
Which generated the following profiles:
recursive_benchmark.cpu.prof.zip
iterative_benchmark.cpu.prof.zip
Taking a peek at the cpu profiles, the top items look similar, substantiating the hypothesis that the performance is dominated by the actual hashing of data. Definitely not an expert on profiling go code so threw the details in here if anyone with more expertise wants to chime in.