perf(levm): optimize PUSH opcode handlers#6390
Conversation
Replace U256::from_big_endian (runtime slice) with u256_from_big_endian_const (const-generic array) so the compiler knows N at monomorphization time, enabling compile-time offset computation and better auto-vectorization of the byte copy for every PUSHn variant.
…-limb values Store each U256 limb individually instead of using a struct assignment or copy_nonoverlapping. This prevents LLVM from grouping limbs[1..3] into a [24 x i8] stack alloca that then requires a memset + memcpy round-trip to reach the EVM stack slot. After the fix, LLVM sees all 4 limbs as independent i64 scalars and can prove that the upper limbs are zero constants for PUSH1-PUSH31, reducing the EVM stack write from 5 memory ops (3 spills + 2 reloads) to 2 stp instructions using the hardware zero register. Before (PUSH1 fast path): str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame str x9, [x11, #24] ; limb[3] After (PUSH1 fast path): stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3]
🤖 Kimi Code ReviewThe changes are performance optimizations for EVM stack operations and PUSH opcode handling. No security vulnerabilities or consensus-critical bugs detected.
|
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewHere is my review of PR 6390: PR Review:
|
🤖 Codex Code ReviewFindings
Residual Risk
Verification
Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Benchmark Results ComparisonNo significant difference was registered for any benchmark run. Detailed ResultsBenchmark Results: BubbleSort
Benchmark Results: ERC20Approval
Benchmark Results: ERC20Mint
Benchmark Results: ERC20Transfer
Benchmark Results: Factorial
Benchmark Results: FactorialRecursive
Benchmark Results: Fibonacci
Benchmark Results: FibonacciRecursive
Benchmark Results: ManyHashes
Benchmark Results: MstoreBench
Benchmark Results: Push
Benchmark Results: SstoreBench_no_opt
|
🤖 Kimi Code ReviewThe changes look correct and represent legitimate micro-optimizations for EVM execution. A few specific observations:
Consensus & Security
Suggestion Overall: LGTM - These are well-documented, targeted optimizations that maintain correctness. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Greptile SummaryThis PR applies two assembly-level micro-optimizations to the PUSH1–PUSH32 opcode handlers in LEVM: switching to a const-generic Key observations:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/vm/levm/src/opcode_handlers/push.rs | Switches to const-generic u256_from_big_endian_const for better LLVM codegen. Also silently fixes a correctness bug in the truncated-bytecode edge case (partial PUSH literal at end-of-code now zero-pads correctly on the right instead of the left), but this fix is undocumented in the PR description. |
| crates/vm/levm/src/call_frame.rs | Replaces copy_nonoverlapping with explicit per-limb assignments in Stack::push. Semantically equivalent; helps LLVM treat the four u64 limbs as independent scalars, eliminating the alloca spill/reload round-trip for PUSH1–PUSH31 where upper limbs are constant zeros. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["OpPushHandler<N>::eval"] --> B["Read literal_offset from bytecode"]
B --> C{bytecode.get\nliteral_offset..}
C -->|"Some(data)\ndata.len() >= N"| D["buf = [0u8; N]\nbuf.copy_from_slice(&data[..N])\nu256_from_big_endian_const(buf)"]
C -->|"Some(data)\ndata.len() < N\n(truncated bytecode)"| E["bytes = [0u8; N]\nbytes[..data.len()].copy_from_slice(data)\nu256_from_big_endian_const(bytes)\n✅ correct: zero-pad on right"]
C -->|None| F["U256::zero()"]
D --> G["Stack::push(value)"]
E --> G
F --> G
G --> H["get_unchecked_mut(next_offset)"]
H --> I["slot.0[0] = value.0[0]\nslot.0[1] = value.0[1]\nslot.0[2] = value.0[2]\nslot.0[3] = value.0[3]"]
I --> J["self.offset = next_offset"]
J --> K["OpcodeResult::Continue"]
style E fill:#d4edda,stroke:#28a745
style D fill:#cfe2ff,stroke:#0d6efd
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/vm/levm/src/opcode_handlers/push.rs
Line: 52-56
Comment:
**Silent behavioral change in truncated-bytecode path**
This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description.
**Old code:**
```rust
let mut bytes = [0; 32]; // 32-byte buffer
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)
```
**New code:**
```rust
let mut bytes = [0u8; N]; // N-byte buffer (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓
```
The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.
This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "perf(levm): eliminat..."
| Some(data) => { | ||
| let mut bytes = [0u8; N]; | ||
| bytes[..data.len()].copy_from_slice(data); | ||
| u256_from_big_endian_const(bytes) | ||
| } |
There was a problem hiding this comment.
Silent behavioral change in truncated-bytecode path
This arm is reached when data.len() < N (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave differently here, and that difference isn't noted in the PR description.
Old code:
let mut bytes = [0; 32]; // 32-byte buffer
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)New code:
let mut bytes = [0u8; N]; // N-byte buffer (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓The new behaviour is correct per the EVM specification (missing bytes are treated as 0x00 appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.
This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-PUSH4) to lock in the correct behaviour.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/src/opcode_handlers/push.rs
Line: 52-56
Comment:
**Silent behavioral change in truncated-bytecode path**
This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description.
**Old code:**
```rust
let mut bytes = [0; 32]; // 32-byte buffer
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)
```
**New code:**
```rust
let mut bytes = [0u8; N]; // N-byte buffer (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data); // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓
```
The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.
This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour.
How can I resolve this? If you propose a fix, please make it concise.
🤖 Claude Code ReviewNow I have enough context to write a thorough review. Code Review:
|
| Item | Severity | Notes |
|---|---|---|
| Bug fix in partial-PUSH bytes not documented | Medium | Silently fixes incorrect EVM spec behavior; needs a test and PR note |
Missing #[inline] on u256_from_big_endian_const |
Medium | Required for the stated optimization to reliably materialize |
Individual limb stores in Stack::push |
Good | Correct, well-commented, sound LLVM argument |
| Code readability | Improved | Separating value computation from stack.push() is cleaner |
The functional correctness of the partial-bytes case and the #[inline] gap are the two items worth addressing before merge.
Automated review by Claude (Anthropic) · sonnet · custom prompt
🤖 Codex Code Review
I did not find any blocking correctness, security, gas-accounting, or memory-safety issues in the touched code. The manual limb stores in call_frame.rs are behaviorally equivalent to the previous I could not run the Rust tests in this environment because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
| // Store each limb individually so LLVM treats them as 4 independent i64 scalars. | ||
| // This prevents LLVM from grouping limbs[1..3] into a [24 x i8] alloca that would | ||
| // then need a memset + memcpy round-trip for values with known-zero upper limbs | ||
| // (e.g. PUSH1-PUSH31), allowing it to emit direct zero stores instead. |
There was a problem hiding this comment.
nit: PUSH25-PUSH31 don't have known-zero upper limbs
Benchmark Block Execution Results Comparison Against Main
|
## Summary This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output. ### Change 1: Use const-generic big-endian conversion (`push.rs`) Replaces `U256::from_big_endian(&data[..N])` (runtime slice) with `u256_from_big_endian_const::<N>(buf)` (const-generic fixed-size array). Because `N` is a compile-time constant at each monomorphized `OpPushHandler<N>`, the compiler can: - Compute the padding offset `32 - N` at compile time - Operate on a fixed-size `[u8; N]` buffer instead of a runtime slice, enabling better autovectorization of the byte copy ### Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`) Replaces the previous `copy_nonoverlapping` (and later direct struct assignment) with individual stores for each of the 4 U256 limbs. **Root cause of the spill:** LLVM's SROA pass decomposes a `U256` value into `limb[0]` (scalar i64) and `limbs[1..3]` (a `[24 x i8]` stack alloca). A struct assignment or `copy_nonoverlapping` still uses `memcpy` for the alloca portion, causing a `memset(alloca, 0) + memcpy(alloca → slot)` round-trip. By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely. **Before** (PUSH1 fast path, 5 memory ops): ```asm str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame ← wasted stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame ← wasted str x9, [x11, #24] ; limb[3] ``` **After** (PUSH1 fast path, 2 memory ops, no stack frame): ```asm stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3] ``` The same spill/reload pattern was confirmed on x86_64 (using `xorps`/`movaps`/`movq` through the stack frame). ## Test plan - [x] `cargo check -p ethrex-levm` - [x] `cargo test -p ethrex-levm --release` - [x] `cargo test -p ethrex --release`
Summary
This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output.
Change 1: Use const-generic big-endian conversion (
push.rs)Replaces
U256::from_big_endian(&data[..N])(runtime slice) withu256_from_big_endian_const::<N>(buf)(const-generic fixed-size array).Because
Nis a compile-time constant at each monomorphizedOpPushHandler<N>, the compiler can:32 - Nat compile time[u8; N]buffer instead of a runtime slice, enabling better autovectorization of the byte copyChange 2: Eliminate stack-frame spill in
Stack::push(call_frame.rs)Replaces the previous
copy_nonoverlapping(and later direct struct assignment) with individual stores for each of the 4 U256 limbs.Root cause of the spill: LLVM's SROA pass decomposes a
U256value intolimb[0](scalar i64) andlimbs[1..3](a[24 x i8]stack alloca). A struct assignment orcopy_nonoverlappingstill usesmemcpyfor the alloca portion, causing amemset(alloca, 0) + memcpy(alloca → slot)round-trip.By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely.
Before (PUSH1 fast path, 5 memory ops):
After (PUSH1 fast path, 2 memory ops, no stack frame):
The same spill/reload pattern was confirmed on x86_64 (using
xorps/movaps/movqthrough the stack frame).Test plan
cargo check -p ethrex-levmcargo test -p ethrex-levm --releasecargo test -p ethrex --release