Skip to content

crypto: Proof of Concept for iterative version of SimpleHashFromByteSlices (#2611)#3530

Merged
xla merged 3 commits intotendermint:developfrom
brapse:2611-simple-hash-from-byte-slice-iterative
Apr 18, 2019
Merged

crypto: Proof of Concept for iterative version of SimpleHashFromByteSlices (#2611)#3530
xla merged 3 commits intotendermint:developfrom
brapse:2611-simple-hash-from-byte-slice-iterative

Conversation

@brapse
Copy link
Contributor

@brapse brapse commented Apr 4, 2019

(#2611) had suggested that an iterative version of
SimpleHashFromByteSlice would be faster, presumably because
we can envision some overhead accumulating from stack
frames and function calls. Additionally, a recursive algorithm risks
hitting the stack limit and causing a stack overflow should the tree
be too large.

Provided here is an iterative alternative, a simple test to assert
correctness and a benchmark. On the performance side, there appears to
be no overall difference:

BenchmarkSimpleHashAlternatives/recursive-4                20000 77677 ns/op
BenchmarkSimpleHashAlternatives/iterative-4                20000 76802 ns/op

On the surface it might seem that the additional overhead is due to
the different allocation patterns of the implementations. The recursive
version uses a single [][]byte slices which it then re-slices at each level of the tree.
The iterative version reproduces [][]byte once within the function and
then rewrites sub-slices of that array at each level of the tree.

Experimenting by modifying the code to simply calculate the
hash and not store the result show little to no difference in performance.

These preliminary results suggest:

  1. The performance of the current implementation is pretty good
  2. Go has low overhead for recursive functions
  3. The performance of the SimpleHashFromByteSlice routine is dominated
    by the actual hashing of data

Although this work is in no way exhaustive, point #3 suggests that
optimization of this routine would need to take an alternative
approach to make significant improvements on the current performance.

Finally, considering that the recursive implementation is easier to
read, it might not be worthwhile to switch to a less intuitive
implementation for so little benefit.

Follow up based on comments:

If you want to go deeper, it may be interesting to compare machine code for these two versions.

Indeed! I'm not actually familiar with machine code so I just did a quick profile to get a sense of where the cpu was spending it's time.

func BenchmarkSimpleHashAlternatives(b *testing.B) {
	total := 10000

	items := make([][]byte, total)
	for i := 0; i < total; i++ {
		items[i] = testItem(cmn.RandBytes(tmhash.Size))
	}

	cpuprof, err := os.Create("recursive_benchmark.cpu.prof")
	if err != nil {
		log.Fatal("could not create CPU profile: ", err)
	}
	if err := pprof.StartCPUProfile(cpuprof); err != nil {
		log.Fatal("could not start CPU profile: ", err)
	}
	b.ResetTimer()
	b.Run("recursive", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			_ = SimpleHashFromByteSlices(items)
		}
	})
	pprof.StopCPUProfile()
	cpuprof.Close()

	cpuprof2, err := os.Create("iterative_benchmark.cpu.prof")
	if err != nil {
		log.Fatal("could not create CPU profile: ", err)
	}
	if err := pprof.StartCPUProfile(cpuprof2); err != nil {
		log.Fatal("could not start CPU profile: ", err)
	}
	b.ResetTimer()
	b.Run("iterative", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			_ = SimpleHashFromByteSlicesIterative(items)
		}
	})
	pprof.StopCPUProfile()
	cpuprof2.Close()
}

Which generated the following profiles:
recursive_benchmark.cpu.prof.zip

iterative_benchmark.cpu.prof.zip

Taking a peek at the cpu profiles, the top items look similar, substantiating the hypothesis that the performance is dominated by the actual hashing of data. Definitely not an expert on profiling go code so threw the details in here if anyone with more expertise wants to chime in.

$ go tool pprof -text crypto/merkle/iterative_benchmark.cpu.prof | head -n 20
Type: cpu
Time: Apr 11, 2019 at 6:44pm (CEST)
Duration: 2.65s, Total samples = 2.28s (86.03%)
Showing nodes accounting for 2.23s, 97.81% of 2.28s total
Dropped 16 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     0.77s 33.77% 33.77%      0.77s 33.77%  crypto/sha256.block
     0.74s 32.46% 66.23%      0.74s 32.46%  runtime.kevent
     0.11s  4.82% 71.05%      0.11s  4.82%  runtime.pthread_cond_timedwait_relative_np
     0.07s  3.07% 74.12%      0.18s  7.89%  runtime.mallocgc
     0.06s  2.63% 76.75%      0.06s  2.63%  runtime.memmove
     0.06s  2.63% 79.39%      0.06s  2.63%  runtime.pthread_cond_signal
     0.06s  2.63% 82.02%      0.06s  2.63%  runtime.pthread_cond_wait
     0.05s  2.19% 84.21%      0.05s  2.19%  runtime.nextFreeFast (inline)
     0.04s  1.75% 85.96%      0.83s 36.40%  crypto/sha256.(*digest).Write
     0.04s  1.75% 87.72%      0.19s  8.33%  runtime.growslice
     0.03s  1.32% 89.04%      0.71s 31.14%  github.com/tendermint/tendermint/crypto/merkle.innerHash
     0.02s  0.88% 89.91%      0.56s 24.56%  crypto/sha256.(*digest).checkSum
     0.02s  0.88% 90.79%      0.96s 42.11%  github.com/tendermint/tendermint/crypto/tmhash.Sum
     0.02s  0.88% 91.67%      0.02s  0.88%  runtime.findObject

$ go tool pprof -text crypto/merkle/recursive_benchmark.cpu.prof | head -n 20
Type: cpu
Time: Apr 11, 2019 at 6:44pm (CEST)
Duration: 2.55s, Total samples = 2.32s (91.08%)
Showing nodes accounting for 2.24s, 96.55% of 2.32s total
Dropped 33 nodes (cum <= 0.01s)
      flat  flat%   sum%        cum   cum%
     0.60s 25.86% 25.86%      0.60s 25.86%  crypto/sha256.block
     0.51s 21.98% 47.84%      0.51s 21.98%  runtime.kevent
     0.47s 20.26% 68.10%      0.57s 24.57%  runtime.stkbucket
     0.13s  5.60% 73.71%      0.13s  5.60%  runtime.memmove
     0.10s  4.31% 78.02%      0.10s  4.31%  runtime.pthread_cond_wait
     0.08s  3.45% 81.47%      0.75s 32.33%  runtime.mallocgc
     0.07s  3.02% 84.48%      0.07s  3.02%  runtime.pthread_cond_signal
     0.04s  1.72% 86.21%      0.61s 26.29%  runtime.mProf_Malloc
     0.04s  1.72% 87.93%      0.04s  1.72%  runtime.pthread_cond_timedwait_relative_np
     0.04s  1.72% 89.66%      0.04s  1.72%  runtime.usleep
     0.03s  1.29% 90.95%      0.64s 27.59%  crypto/sha256.(*digest).Write
     0.03s  1.29% 92.24%      0.69s 29.74%  crypto/sha256.Sum256
     0.02s  0.86% 93.10%      0.62s 26.72%  runtime.growslice
     0.02s  0.86% 93.97%      0.02s  0.86%  runtime.memclrNoHeapPointers

$ go tool pprof -text -diff_base=crypto/merkle/recursive_benchmark.cpu.prof  crypto/merkle/iterative_benchmark.cpu.prof | head -n 20
Type: cpu
Time: Apr 11, 2019 at 6:44pm (CEST)
Duration: 5.20s, Total samples = 2.32s (44.64%)
Showing nodes accounting for -0.04s, 1.72% of 2.32s total
      flat  flat%   sum%        cum   cum%
    -0.47s 20.26% 20.26%     -0.57s 24.57%  runtime.stkbucket
     0.23s  9.91% 10.34%      0.23s  9.91%  runtime.kevent
     0.17s  7.33%  3.02%      0.17s  7.33%  crypto/sha256.block
    -0.07s  3.02%  6.03%     -0.07s  3.02%  runtime.memmove
     0.07s  3.02%  3.02%      0.07s  3.02%  runtime.pthread_cond_timedwait_relative_np
    -0.04s  1.72%  4.74%     -0.61s 26.29%  runtime.mProf_Malloc
     0.04s  1.72%  3.02%      0.04s  1.72%  runtime.nextFreeFast (inline)
    -0.04s  1.72%  4.74%     -0.04s  1.72%  runtime.pthread_cond_wait
    -0.04s  1.72%  6.47%     -0.04s  1.72%  runtime.usleep
    -0.02s  0.86%  7.33%      0.18s  7.76%  crypto/sha256.Sum256
     0.02s  0.86%  6.47%     -0.23s  9.91%  github.com/tendermint/tendermint/crypto/merkle.innerHash
     0.02s  0.86%  5.60%      0.08s  3.45%  github.com/tendermint/tendermint/crypto/tmhash.Sum
     0.02s  0.86%  4.74%      0.02s  0.86%  runtime.findObject
     0.02s  0.86%  3.88%     -0.43s 18.53%  runtime.growslice
     0.02s  0.86%  3.02%      0.02s  0.86%  runtime.nanotime
  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

(tendermint#2611) had suggested that an iterative version of
SimpleHashFromByteSlice would be faster, presumably because
 we can envision some overhead accumulating from stack
frames and function calls. Additionally, a recursive algorithm risks
hitting the stack limit and causing a stack overflow should the tree
be too large.

Provided here is an iterative alternative, a simple test to assert
correctness and a benchmark. On the performance side, there appears to
be no overall difference:

```
BenchmarkSimpleHashAlternatives/recursive-4                20000 77677 ns/op
BenchmarkSimpleHashAlternatives/iterative-4                20000 76802 ns/op
```

On the surface it might seem that the additional overhead is due to
the different allocation patterns of the implementations. The recursive
version uses a single `[][]byte` slices which it then re-slices at each level of the tree.
The iterative version reproduces `[][]byte` once within the function and
then rewrites sub-slices of that array at each level of the tree.

Eexperimenting by modifying the code to simply calculate the
hash and not store the result show little to no difference in performance.

These preliminary results suggest:
1. The performance of the current implementation is pretty good
2. Go has low overhead for recursive functions
3. The performance of the SimpleHashFromByteSlice routine is dominated
by the actual hashing of data

Although this work is in no way exhaustive, point tendermint#3 suggests that
optimizations of this routine would need to take an alternative
approach to make significant improvements on the current performance.

Finally, considering that the recursive implementation is easier to
read, it might not be worthwhile to switch to a less intuitive
implementation for so little benefit.
@brapse brapse requested review from ebuchman, melekes and xla as code owners April 4, 2019 13:09
@codecov-io
Copy link

codecov-io commented Apr 4, 2019

Codecov Report

Merging #3530 into develop will increase coverage by 0.1%.
The diff coverage is 91.3%.

@@            Coverage Diff            @@
##           develop   #3530     +/-   ##
=========================================
+ Coverage    64.09%   64.2%   +0.1%     
=========================================
  Files          213     213             
  Lines        17359   17387     +28     
=========================================
+ Hits         11127   11164     +37     
+ Misses        5304    5303      -1     
+ Partials       928     920      -8
Impacted Files Coverage Δ
crypto/merkle/simple_tree.go 77.08% <91.3%> (+13.08%) ⬆️
p2p/pex/pex_reactor.go 79.55% <0%> (-1.75%) ⬇️
p2p/netaddress.go 69.54% <0%> (-1.15%) ⬇️
rpc/client/httpclient.go 65.62% <0%> (-0.9%) ⬇️
p2p/transport.go 84.27% <0%> (-0.54%) ⬇️
consensus/reactor.go 71.07% <0%> (-0.48%) ⬇️
p2p/peer.go 60.4% <0%> (-0.15%) ⬇️
node/node.go 63.81% <0%> (-0.08%) ⬇️
p2p/pex/file.go 78.04% <0%> (ø) ⬆️
p2p/switch.go 65.28% <0%> (ø) ⬆️
... and 9 more

@melekes
Copy link
Contributor

melekes commented Apr 8, 2019

Cool analysis 🍪 Thanks for sharing!

If you want to go deeper, it may be interesting to compare machine code for these two versions.

Finally, considering that the recursive implementation is easier to read, it might not be worthwhile to switch to a less intuitive implementation for so little benefit.

Agree here 👍

@liamsi
Copy link
Contributor

liamsi commented Apr 8, 2019

Very cool indeed! Thanks for the extensive write-up, too.

Referencing #3313 which suggests there is a memory leak in crypto/merkle. It might be worth to revisit that issue with the iterative code, too. It might also be interesting to experiment with different number of elements (currently 100) and see if there are any differences with much larger (or very small) number of items.

@xla xla changed the title Proof of Concept: Iterative version of SimpleHashFromByteSlices (#2611) crypto: Proof of Concept for iterative version of SimpleHashFromByteSlices (#2611) Apr 12, 2019
@brapse
Copy link
Contributor Author

brapse commented Apr 12, 2019

Thanks for the great feedback 🙇

Regarding the tree size, I had hoped that a bigger tree (and deeper stacks) would produce a performance gains for the iterative version but that did not turn out to be the case (up to tree size 1,000,000)

Also, edited the PR description to include cpu profiles which seem to provide some evidence that the crypto/sha256.block routine does indeed dominate the overall performance.

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 :octocat: 🐙 :shipit: 💃

Great contribution, thanks so much for diving into this and sharing your findings.

@xla xla merged commit 671c5c9 into tendermint:develop Apr 18, 2019
@melekes melekes mentioned this pull request May 7, 2019
36 tasks
brapse added a commit to brapse/tendermint that referenced this pull request Jun 5, 2019
…lices (tendermint#2611) (tendermint#3530)

(tendermint#2611) had suggested that an iterative version of
SimpleHashFromByteSlice would be faster, presumably because
 we can envision some overhead accumulating from stack
frames and function calls. Additionally, a recursive algorithm risks
hitting the stack limit and causing a stack overflow should the tree
be too large.

Provided here is an iterative alternative, a simple test to assert
correctness and a benchmark. On the performance side, there appears to
be no overall difference:

```
BenchmarkSimpleHashAlternatives/recursive-4                20000 77677 ns/op
BenchmarkSimpleHashAlternatives/iterative-4                20000 76802 ns/op
```

On the surface it might seem that the additional overhead is due to
the different allocation patterns of the implementations. The recursive
version uses a single `[][]byte` slices which it then re-slices at each level of the tree.
The iterative version reproduces `[][]byte` once within the function and
then rewrites sub-slices of that array at each level of the tree.

Eexperimenting by modifying the code to simply calculate the
hash and not store the result show little to no difference in performance.

These preliminary results suggest:
1. The performance of the current implementation is pretty good
2. Go has low overhead for recursive functions
3. The performance of the SimpleHashFromByteSlice routine is dominated
by the actual hashing of data

Although this work is in no way exhaustive, point tendermint#3 suggests that
optimizations of this routine would need to take an alternative
approach to make significant improvements on the current performance.

Finally, considering that the recursive implementation is easier to
read, it might not be worthwhile to switch to a less intuitive
implementation for so little benefit.

* re-add slice re-writing
* [crypto] Document SimpleHashFromByteSlicesIterative
iKapitonau pushed a commit to scrtlabs/tendermint that referenced this pull request Sep 30, 2024
…ort tendermint#3530) (tendermint#3559)

Running `make all` would cause an error: `make: *** No rule to make
target 'check', needed by 'all'. Stop.`

The `check` target has already removed, let's delete it<hr>This is an
automatic backport of pull request tendermint#3530 done by
[Mergify](https://mergify.com).

Co-authored-by: Halimao <1065621723@qq.com>
cboh4 pushed a commit to scrtlabs/tendermint that referenced this pull request Apr 7, 2025
…ort tendermint#3530) (tendermint#3561)

Running `make all` would cause an error: `make: *** No rule to make
target 'check', needed by 'all'. Stop.`

The `check` target has already removed, let's delete it<hr>This is an
automatic backport of pull request tendermint#3530 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Halimao <1065621723@qq.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants