Java Code Optimization for Hedera's Protobuf Serialization in Hedera
My process to achieve a 5.6x performance improvement
When you’re building distributed financial systems that process thousands of transactions per second, every microsecond counts. This is the story of my first collaborative open source contribution where I made a performance optimization to Hedera Hashgraph’s Protocol Buffer implementation. This achieved a 5.6x speedup.
Here’s what I learned about protocol-level optimization in the process.
Why Serialization Performance Matters
Hedera Hashgraph is a public distributed ledger that uses a novel consensus algorithm based on gossip and virtual voting. At its core, the network relies on efficient serialization and deserialization of protocol messages to maintain its high throughput.
Briefly, what distinguishes Hedera from blockchain is that the hashgraph does not bundle transactions into a linear chain blocks. This hashgraph approach is lighter weight than Ethereum and keeps transaction costs low on the HBAR network.
The repository I contributed to, PBJ (Protocol Buffers for Java), is Hedera’s custom implementation optimized for their specific performance requirements. PBJ is designed for the unique demands of consensus protocols. This requires minimal allocation overhead, predictable performance characteristics, and efficient memory access patterns.1
The specific issue I tackled (#205) involved the DirectBufferedData.contains() method, which searches for byte patterns within buffered data. This operation happens frequently during message parsing and validation. This makes it a critical path for overall system performance.2
The Investigation: Finding the Bottleneck
My first step was building comprehensive benchmarks using JMH (Java Microbenchmark Harness). I needed to understand the performance characteristics across different pattern sizes. The real-world workload includes everything from small 4-byte patterns to larger 256-byte sequences.3
The original implementation used a straightforward byte-by-byte comparison:
for (int i = 0; i <= length - patternLength; i++) {
boolean match = true;
for (int j = 0; j < patternLength; j++) {
if (getByte(offset + i + j) != pattern[j]) {
match = false;
break;
}
}
if (match) return true;
}
While simple and correct, this approach left performance on the table for larger patterns. Each byte access involved bounds checking and individual memory reads, rather than leveraging bulk memory operations available in modern JVMs (Java Virtual Machine).
Running the initial benchmarks revealed the opportunity: for patterns larger than a few bytes, we could achieve significant speedups by using bulk comparison operations instead of byte-by-byte iteration.
The Solution: Bulk Memory Operations
The optimization involved two key changes:
1. Implementing bulk comparison using UnsafeUtils
For larger patterns, I replaced the byte-by-byte loop with bulk memory operations:
private boolean bulkContains(final long offset, final long length,
final byte[] pattern) {
final int patternLength = pattern.length;
for (long i = 0; i <= length - patternLength; i++) {
if (UnsafeUtils.equals(buffer, offset + i,
pattern, 0, patternLength)) {
return true;
}
}
return false;
}
UnsafeUtils.equals() is a Java library which performs optimized bulk memory comparison. This takes advantage of CPU-level optimizations for comparing memory regions.
2. Finding the optimal threshold
The tricky part was determining when to use bulk operations versus byte-by-byte comparison. Small patterns actually perform worse with bulk operations due to method call overhead and array allocation.
I built a threshold testing utility (DirectBufferedDataThresholdTest) that measured performance across different cutoff points: 4, 8, 16, 32, and 64 bytes.
Initial testing with an 8-byte threshold showed a regression for small patterns. After comprehensive analysis, I settled on 32 bytes as the optimal threshold:
Patterns ≤32 bytes: Use the original byte-by-byte comparison (no allocation overhead)
Patterns >32 bytes: Use bulk operations (significant speedup)
Results: 5.6x Improvement for Large Patterns
The final benchmarks showed significant improvements:
For small patterns, we maintained baseline performance with no regression. For larger patterns (the target use case), we achieved substantial speedups that compound across the thousands of serialization operations happening per second in a production Hedera network.
The pull request was reviewed by the core team, and merged into main. This validated that the optimization addressed a real performance concern in production systems.
Lessons Learned: Performance Optimization in Practice
This contribution reinforced several key principles about protocol-level optimization:
1. Measure, don’t assume
The 8-byte threshold seemed reasonable initially, but comprehensive benchmarking revealed it caused regressions. Only by testing across the full range of realistic inputs did the 32-byte threshold emerge as optimal.
2. Performance optimization involves trade-offs
There’s no single “best” implementation. The optimal approach depends on the distribution of your actual workload. The 32-byte threshold balances the common case (smaller patterns) against the high-impact case (larger patterns).
3. Benchmarking methodology matters
JMH’s warmup periods, iteration counts, and garbage collection controls are critical for accurate measurements. Without proper methodology, you’re optimizing based on noise rather than signal.
4. Code review improves the solution
The Hedera team’s feedback during review, particularly around the small array regression, led me to the threshold analysis that made the final solution robust. Open source collaboration works.
Why This Matters
This optimization is now part of Hedera’s core protocol infrastructure, improving performance for every node in the network. While a 5.6x speedup on a single operation might seem incremental, these improvements compound across the system.
More broadly, this work exemplifies the kind of engineering I’m drawn to: finding performance bottlenecks through systematic measurement, implementing solutions that balance trade-offs, and validating impact through rigorous testing. It’s the same mindset I applied to distributed SCADA systems for power grids. Whether you’re coordinating consensus in a blockchain or managing real-time control in electrical infrastructure, performance and reliability are paramount.
The full implementation is available in PR #605 on Hedera’s PBJ repository. Here PR stands for pull request which is the method used for code reviews in GitHub.4
I’m looking forward to learning more about Hedera, blockchain protocols, and benchmarking. Excited to keep building.
About me: I’m a distributed systems engineer transitioning from mission-critical SCADA infrastructure into blockchain protocol engineering. Previously spent 9 years building high-availability systems for electrical utilities at AspenTech. Currently contributing to Hedera Hashgraph and exploring protocol engineering opportunities. Connect with me on LinkedIn.



