Skip to content

perf(levm): optimize PUSH opcode handlers#6390

Merged
azteca1998 merged 3 commits into
mainfrom
levm-push-opcode-opt
Mar 26, 2026
Merged

perf(levm): optimize PUSH opcode handlers#6390
azteca1998 merged 3 commits into
mainfrom
levm-push-opcode-opt

Conversation

@azteca1998

Copy link
Copy Markdown
Contributor

Summary

This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output.

Change 1: Use const-generic big-endian conversion (push.rs)

Replaces U256::from_big_endian(&data[..N]) (runtime slice) with u256_from_big_endian_const::<N>(buf) (const-generic fixed-size array).

Because N is a compile-time constant at each monomorphized OpPushHandler<N>, the compiler can:

  • Compute the padding offset 32 - N at compile time
  • Operate on a fixed-size [u8; N] buffer instead of a runtime slice, enabling better autovectorization of the byte copy

Change 2: Eliminate stack-frame spill in Stack::push (call_frame.rs)

Replaces the previous copy_nonoverlapping (and later direct struct assignment) with individual stores for each of the 4 U256 limbs.

Root cause of the spill: LLVM's SROA pass decomposes a U256 value into limb[0] (scalar i64) and limbs[1..3] (a [24 x i8] stack alloca). A struct assignment or copy_nonoverlapping still uses memcpy for the alloca portion, causing a memset(alloca, 0) + memcpy(alloca → slot) round-trip.

By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely.

Before (PUSH1 fast path, 5 memory ops):

str   x9, [x11]            ; limb[0]
ldur  q0, [sp, #8]         ; reload zeros from frame  ← wasted
stur  q0, [x11, #8]        ; limbs[1..2]
ldr   x9, [sp, #24]        ; reload zeros from frame  ← wasted
str   x9, [x11, #24]       ; limb[3]

After (PUSH1 fast path, 2 memory ops, no stack frame):

stp   x9, xzr, [x11]       ; limb[0] + limb[1]
stp   xzr, xzr, [x11, #16] ; limb[2] + limb[3]

The same spill/reload pattern was confirmed on x86_64 (using xorps/movaps/movq through the stack frame).

Test plan

  • cargo check -p ethrex-levm
  • cargo test -p ethrex-levm --release
  • cargo test -p ethrex --release

Replace U256::from_big_endian (runtime slice) with u256_from_big_endian_const
(const-generic array) so the compiler knows N at monomorphization time,
enabling compile-time offset computation and better auto-vectorization of
the byte copy for every PUSHn variant.
…-limb values

Store each U256 limb individually instead of using a struct assignment or
copy_nonoverlapping. This prevents LLVM from grouping limbs[1..3] into a
[24 x i8] stack alloca that then requires a memset + memcpy round-trip to
reach the EVM stack slot.

After the fix, LLVM sees all 4 limbs as independent i64 scalars and can
prove that the upper limbs are zero constants for PUSH1-PUSH31, reducing
the EVM stack write from 5 memory ops (3 spills + 2 reloads) to 2 stp
instructions using the hardware zero register.

Before (PUSH1 fast path):
  str   x9, [x11]           ; limb[0]
  ldur  q0, [sp, #8]        ; reload zeros from frame
  stur  q0, [x11, #8]       ; limbs[1..2]
  ldr   x9, [sp, #24]       ; reload zeros from frame
  str   x9, [x11, #24]      ; limb[3]

After (PUSH1 fast path):
  stp   x9, xzr, [x11]      ; limb[0] + limb[1]
  stp   xzr, xzr, [x11, #16] ; limb[2] + limb[3]
@github-actions github-actions Bot added levm Lambda EVM implementation performance Block execution throughput and performance in general labels Mar 19, 2026
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

The changes are performance optimizations for EVM stack operations and PUSH opcode handling. No security vulnerabilities or consensus-critical bugs detected.

crates/vm/levm/src/call_frame.rs

Lines 91-101: The switch from ptr::copy_nonoverlapping to individual limb assignments is a valid LLVM optimization. The comment correctly explains the rationale—avoiding alloca coalescing that prevents zero-store elision for small PUSH values.

Safety: The unsafe block remains sound. The next_offset check at line 90 ensures the index is in bounds, and get_unchecked_mut is used appropriately.

Suggestion: Add a debug assertion documenting the U256 layout assumption (4 × u64 limbs), since the code now depends on the specific internal representation:

debug_assert_eq!(std::mem::size_of::<U256>(), 32);
debug_assert_eq!(U64_PER_U256, 4);

crates/vm/levm/src/opcode_handlers/push.rs

Lines 44-60: The refactor to use u256_from_big_endian_const is correct. The padding logic for partial bytecodes (when data.len() < N) properly places available bytes at the most significant positions (big-endian), leaving zeros in the least significant bytes—matching EVM consensus rules.

Note: The intermediate stack arrays ([0u8; N]) introduce a copy that wasn't present in the slice-based U256::from_big_endian call, but this is necessary for the const-generic API. Since N ≤ 32, the stack pressure is negligible.

Style: Consider consistent variable naming—buf in the first arm vs bytes in the second.

Import: Verify that u256_from_big_endian_const is marked #[inline] in ethrex_common to ensure the const-generic optimization actually propagates through the call boundary.

Overall: ACK. The changes are idiomatic Rust optimizations that preserve EVM semantics.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions

Copy link
Copy Markdown

Lines of code report

Total lines added: 0
Total lines removed: 1
Total lines changed: 1

Detailed view
+---------------------------------------------------+-------+------+
| File                                              | Lines | Diff |
+---------------------------------------------------+-------+------+
| ethrex/crates/vm/levm/src/opcode_handlers/push.rs | 48    | -1   |
+---------------------------------------------------+-------+------+

@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

Here is my review of PR 6390:


PR Review: perf(levm): optimize PUSH opcode handlers

Overall Assessment

Well-reasoned performance optimization with solid assembly-level justification. The changes are small, targeted, and the PR description is thorough. One correctness issue is worth calling out explicitly.


Correctness: Hidden Bug Fix in Partial-Data Case

The push.rs change silently fixes a real correctness bug in the Some(data) if data.len() < N arm. This is worth highlighting because it's not mentioned in the PR description.

Old behavior (push.rs, pre-patch Some(data) arm):

let mut bytes = [0; 32];
bytes[..data.len()].copy_from_slice(data);
U256::from_big_endian(&bytes)   // 32-byte buffer, data at MSB side

For PUSH4 with 2 remaining bytes [0xAB, 0xCD]:

  • bytes = [0xAB, 0xCD, 0x00, ..., 0x00] (32 bytes)
  • U256::from_big_endian gives 0xABCD * 2^240 — completely wrong

New behavior (correct EVM semantics — missing bytes are zero-padded on the LSB side):

let mut bytes = [0u8; N];       // N-byte buffer
bytes[..data.len()].copy_from_slice(data);
u256_from_big_endian_const(bytes) // pads to 32 by prepending zeros

For the same PUSH4 with [0xAB, 0xCD]:

  • bytes = [0xAB, 0xCD, 0x00, 0x00] (4 bytes)
  • After padding in u256_from_big_endian_const: [0x00×28, 0xAB, 0xCD, 0x00, 0x00]
  • Result: 0xABCD0000 — correct per Yellow Paper (zero-pad trailing)

This is an edge case (PUSH at the very end of bytecode), but it's spec-defined behavior. Consider documenting this fix explicitly in the PR description and/or as a comment.


call_frame.rs — Limb-by-Limb Store

// call_frame.rs:97-101
let slot = self.values.get_unchecked_mut(next_offset);
slot.0[0] = value.0[0];
slot.0[1] = value.0[1];
slot.0[2] = value.0[2];
slot.0[3] = value.0[3];
  • Semantically identical to the prior copy_nonoverlapping (4 × u64 = 32 bytes either way).
  • The unsafe justification is unchanged and remains valid — next_offset is already bounds-checked by checked_sub.
  • The assembly improvement described in the PR (2 stores vs. 5 for PUSH1) is plausible and consistent with LLVM SROA behavior.
  • No issues here.

push.rs — Normal Case Intermediate Copy

// push.rs:48-50
let mut buf = [0u8; N];
buf.copy_from_slice(&data[..N]);
u256_from_big_endian_const(buf)

The intermediate buf copy is intentional — it gives LLVM a fixed-size [u8; N] array rather than a slice, enabling compile-time constant folding of 32 - N and better autovectorization. No issues. The #[expect(clippy::indexing_slicing)] suppression is correctly placed on the guarded arm.


Minor Nits

  • u256_from_big_endian_const in crates/common/utils.rs:27 has no #[inline] attribute. Since OpPushHandler<N>::eval is #[inline(always)] and this is a small, monomorphized function, the compiler will almost certainly inline it — but an explicit #[inline] would be defensive and consistent with the hot-path intent.

  • The #[expect(clippy::indexing_slicing)] attribute moved from wrapping the whole stack.push(...) call to just the inner match arm. This is strictly more precise. No issue, minor style improvement.


Summary

The optimization is correct and well-motivated. The assembly walkthrough in the description is exactly the right level of justification for unsafe and layout-aware code. The main recommendation is to explicitly document the partial-data correctness fix — it is a behavioral change in a spec-defined code path, and reviewers/bisectors will want to know.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  • No correctness, security, or gas-accounting issues stood out in this diff. The stack write change in call_frame.rs is behavior-preserving, and the PUSHn refactor in push.rs still matches EVM semantics for both full and truncated immediates.

Residual Risk

  • I would still add a regression test for truncated PUSHn bytecode at the decode path in push.rs. That case is consensus-sensitive, and this PR changes the helper used there. A simple case like PUSH2 0x12 at EOF should assert the pushed value is 0x1200, not 0x12 or a left-shifted 256-bit artifact.

Verification

  • I could not run cargo test in this environment because rustup attempted to create temp files under a read-only filesystem.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions

github-actions Bot commented Mar 19, 2026

Copy link
Copy Markdown

Benchmark Results Comparison

No significant difference was registered for any benchmark run.

Detailed Results

Benchmark Results: BubbleSort

Command Mean [s] Min [s] Max [s] Relative
main_revm_BubbleSort 2.963 ± 0.027 2.939 3.033 1.10 ± 0.01
main_levm_BubbleSort 2.718 ± 0.052 2.681 2.853 1.01 ± 0.02
pr_revm_BubbleSort 2.972 ± 0.022 2.947 3.028 1.10 ± 0.01
pr_levm_BubbleSort 2.703 ± 0.019 2.681 2.745 1.00

Benchmark Results: ERC20Approval

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Approval 981.8 ± 13.9 970.4 1017.9 1.01 ± 0.01
main_levm_ERC20Approval 1033.6 ± 16.8 1021.6 1078.4 1.06 ± 0.02
pr_revm_ERC20Approval 972.0 ± 2.6 969.1 976.3 1.00
pr_levm_ERC20Approval 1023.7 ± 3.9 1016.5 1028.6 1.05 ± 0.00

Benchmark Results: ERC20Mint

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Mint 131.2 ± 0.7 130.1 132.4 1.00 ± 0.01
main_levm_ERC20Mint 149.0 ± 0.6 148.4 150.1 1.14 ± 0.01
pr_revm_ERC20Mint 131.0 ± 0.6 129.9 131.8 1.00
pr_levm_ERC20Mint 149.4 ± 0.8 148.2 151.1 1.14 ± 0.01

Benchmark Results: ERC20Transfer

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Transfer 230.6 ± 1.0 229.3 232.2 1.00
main_levm_ERC20Transfer 253.7 ± 2.0 251.8 257.8 1.10 ± 0.01
pr_revm_ERC20Transfer 231.2 ± 2.0 229.1 234.5 1.00 ± 0.01
pr_levm_ERC20Transfer 254.8 ± 1.7 252.6 258.1 1.10 ± 0.01

Benchmark Results: Factorial

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Factorial 224.6 ± 3.3 222.8 233.8 1.00 ± 0.02
main_levm_Factorial 267.0 ± 3.0 263.1 272.3 1.19 ± 0.01
pr_revm_Factorial 223.7 ± 0.7 222.6 225.0 1.00
pr_levm_Factorial 259.7 ± 1.2 258.2 261.5 1.16 ± 0.01

Benchmark Results: FactorialRecursive

Command Mean [s] Min [s] Max [s] Relative
main_revm_FactorialRecursive 1.574 ± 0.044 1.471 1.620 1.00
main_levm_FactorialRecursive 1.646 ± 0.011 1.632 1.663 1.05 ± 0.03
pr_revm_FactorialRecursive 1.607 ± 0.045 1.546 1.668 1.02 ± 0.04
pr_levm_FactorialRecursive 1.594 ± 0.008 1.586 1.608 1.01 ± 0.03

Benchmark Results: Fibonacci

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Fibonacci 203.9 ± 2.3 199.0 208.1 1.00 ± 0.01
main_levm_Fibonacci 230.2 ± 2.3 227.9 235.6 1.13 ± 0.01
pr_revm_Fibonacci 203.5 ± 1.0 201.5 204.9 1.00
pr_levm_Fibonacci 232.7 ± 19.3 225.1 287.3 1.14 ± 0.10

Benchmark Results: FibonacciRecursive

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_FibonacciRecursive 835.2 ± 7.5 818.5 842.5 1.18 ± 0.05
main_levm_FibonacciRecursive 708.3 ± 30.0 690.8 792.5 1.00
pr_revm_FibonacciRecursive 833.9 ± 11.6 820.6 852.7 1.18 ± 0.05
pr_levm_FibonacciRecursive 717.2 ± 31.6 691.7 805.1 1.01 ± 0.06

Benchmark Results: ManyHashes

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ManyHashes 8.4 ± 0.1 8.3 8.7 1.01 ± 0.02
main_levm_ManyHashes 9.9 ± 0.0 9.8 9.9 1.17 ± 0.01
pr_revm_ManyHashes 8.4 ± 0.1 8.3 8.5 1.00
pr_levm_ManyHashes 9.9 ± 0.1 9.8 10.0 1.18 ± 0.01

Benchmark Results: MstoreBench

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_MstoreBench 264.5 ± 3.9 260.9 274.6 1.13 ± 0.02
main_levm_MstoreBench 233.6 ± 0.9 232.7 235.8 1.00
pr_revm_MstoreBench 268.3 ± 18.8 260.6 321.7 1.15 ± 0.08
pr_levm_MstoreBench 234.5 ± 1.0 232.6 235.8 1.00 ± 0.01

Benchmark Results: Push

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Push 287.4 ± 1.4 285.4 290.6 1.03 ± 0.01
main_levm_Push 279.0 ± 3.8 276.1 287.3 1.00 ± 0.02
pr_revm_Push 287.7 ± 3.4 283.2 293.6 1.04 ± 0.02
pr_levm_Push 277.9 ± 2.7 275.2 283.9 1.00

Benchmark Results: SstoreBench_no_opt

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_SstoreBench_no_opt 167.9 ± 2.0 164.4 171.3 1.69 ± 0.02
main_levm_SstoreBench_no_opt 99.4 ± 0.3 99.0 99.8 1.00
pr_revm_SstoreBench_no_opt 164.6 ± 3.9 160.6 169.3 1.66 ± 0.04
pr_levm_SstoreBench_no_opt 99.6 ± 0.8 98.9 101.0 1.00 ± 0.01

@azteca1998 azteca1998 marked this pull request as ready for review March 19, 2026 21:20
@azteca1998 azteca1998 requested a review from a team as a code owner March 19, 2026 21:20
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

The changes look correct and represent legitimate micro-optimizations for EVM execution. A few specific observations:

crates/vm/levm/src/call_frame.rs (lines 91-102)

  • The manual limb assignment is a valid optimization to help LLVM scalarize the U256 store, avoiding memset/memcpy overhead for small PUSH values. The unsafe block is properly justified and bounded.
  • Ensure that ethrex_common::U256 guarantees a stable internal representation with field 0 as [u64; 4] (standard for primitive-types U256, but worth confirming this crate doesn't use a wrapper that hides the field).

crates/vm/levm/src/opcode_handlers/push.rs (lines 10, 44-60)

  • The switch to u256_from_big_endian_const correctly handles the big-endian byte ordering required by PUSH opcodes.
  • The padding logic for truncated bytecode (lines 54-58) correctly zero-fills the remaining bytes when data.len() < N, maintaining consensus compatibility with out-of-bounds reads returning 0.
  • Minor nit: In the first match arm (lines 48-51), the array initialization [0u8; N] is technically redundant since you immediately overwrite all N bytes with copy_from_slice. You could use unsafe { *(data[..N].as_ptr() as *const [u8; N]) } or data[..N].try_into().unwrap() since the length is already verified, but the current approach is safe and clear.

Consensus & Security

  • No consensus issues detected. The behavior for PUSH operations with insufficient bytecode remains correct (zero-padding).
  • No new panic vectors introduced; all indexing is guarded by length checks or uses const-generic arrays.

Suggestion
Consider adding a debug assertion or comment in push.rs confirming that N is within 1..=32 (since PUSH0 is handled separately), though the const-generic bounds likely enforce this at compile time.

Overall: LGTM - These are well-documented, targeted optimizations that maintain correctness.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@greptile-apps

greptile-apps Bot commented Mar 19, 2026

Copy link
Copy Markdown

Greptile Summary

This PR applies two assembly-level micro-optimizations to the PUSH1–PUSH32 opcode handlers in LEVM: switching to a const-generic u256_from_big_endian_const so LLVM can operate on fixed-size [u8; N] arrays instead of runtime slices, and replacing copy_nonoverlapping with four explicit limb assignments in Stack::push so LLVM treats all limbs as independent i64 scalars and eliminates the alloca spill/reload pattern for values with known-zero upper limbs. Both changes are well-motivated by the generated assembly analysis in the PR description and are semantically correct for the common code path.

Key observations:

  • The call_frame.rs change is a pure mechanical substitution — copy_nonoverlapping of 4 u64s is exactly equivalent to four individual assignments, and the safety invariant (get_unchecked_mut is guarded by the prior checked_sub on the offset) is unchanged.
  • The push.rs change also silently fixes a correctness bug in the truncated-bytecode edge case (Some(data) arm where data.len() < N). The old code placed partial literal bytes at the most-significant end of a 32-byte buffer, producing a wildly inflated value. The new code correctly zero-pads on the right within the N-byte window, matching EVM spec behaviour (missing bytes treated as 0x00 appended to the right of the literal). This fix is not documented in the PR description and has no dedicated test.
  • u256_from_big_endian_const is a const-generic function, so Rust monomorphizes it per calling crate — the lack of an explicit #[inline] attribute is not a concern here.

Confidence Score: 4/5

  • Safe to merge; the hot-path is functionally identical to before and the one behavioral change is a correctness improvement per EVM spec.
  • Both changed files are small and well-reasoned. The call_frame.rs change is a straightforward mechanical substitution with no semantic difference. The push.rs change is correct and also incidentally fixes a long-standing edge-case bug (truncated PUSH literal). The only risk is the undocumented behavioral change for truncated bytecodes, but the new behavior is more spec-compliant and existing tests pass. A dedicated test for the truncated-bytecode path would increase confidence to 5.
  • crates/vm/levm/src/opcode_handlers/push.rs — contains an undocumented correctness fix in the Some(data) arm that should be documented and tested.

Important Files Changed

Filename Overview
crates/vm/levm/src/opcode_handlers/push.rs Switches to const-generic u256_from_big_endian_const for better LLVM codegen. Also silently fixes a correctness bug in the truncated-bytecode edge case (partial PUSH literal at end-of-code now zero-pads correctly on the right instead of the left), but this fix is undocumented in the PR description.
crates/vm/levm/src/call_frame.rs Replaces copy_nonoverlapping with explicit per-limb assignments in Stack::push. Semantically equivalent; helps LLVM treat the four u64 limbs as independent scalars, eliminating the alloca spill/reload round-trip for PUSH1–PUSH31 where upper limbs are constant zeros.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["OpPushHandler&lt;N&gt;::eval"] --> B["Read literal_offset from bytecode"]
    B --> C{bytecode.get\nliteral_offset..}
    C -->|"Some(data)\ndata.len() >= N"| D["buf = [0u8; N]\nbuf.copy_from_slice(&data[..N])\nu256_from_big_endian_const(buf)"]
    C -->|"Some(data)\ndata.len() < N\n(truncated bytecode)"| E["bytes = [0u8; N]\nbytes[..data.len()].copy_from_slice(data)\nu256_from_big_endian_const(bytes)\n✅ correct: zero-pad on right"]
    C -->|None| F["U256::zero()"]
    D --> G["Stack::push(value)"]
    E --> G
    F --> G
    G --> H["get_unchecked_mut(next_offset)"]
    H --> I["slot.0[0] = value.0[0]\nslot.0[1] = value.0[1]\nslot.0[2] = value.0[2]\nslot.0[3] = value.0[3]"]
    I --> J["self.offset = next_offset"]
    J --> K["OpcodeResult::Continue"]

    style E fill:#d4edda,stroke:#28a745
    style D fill:#cfe2ff,stroke:#0d6efd
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/vm/levm/src/opcode_handlers/push.rs
Line: 52-56

Comment:
**Silent behavioral change in truncated-bytecode path**

This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description.

**Old code:**
```rust
let mut bytes = [0; 32];          // 32-byte buffer
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)
```

**New code:**
```rust
let mut bytes = [0u8; N];         // N-byte buffer  (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓
```

The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.

This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "perf(levm): eliminat..."

Comment on lines +52 to +56
Some(data) => {
let mut bytes = [0u8; N];
bytes[..data.len()].copy_from_slice(data);
u256_from_big_endian_const(bytes)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent behavioral change in truncated-bytecode path

This arm is reached when data.len() < N (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave differently here, and that difference isn't noted in the PR description.

Old code:

let mut bytes = [0; 32];          // 32-byte buffer
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)

New code:

let mut bytes = [0u8; N];         // N-byte buffer  (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓

The new behaviour is correct per the EVM specification (missing bytes are treated as 0x00 appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.

This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-PUSH4) to lock in the correct behaviour.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/src/opcode_handlers/push.rs
Line: 52-56

Comment:
**Silent behavioral change in truncated-bytecode path**

This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description.

**Old code:**
```rust
let mut bytes = [0; 32];          // 32-byte buffer
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)
```

**New code:**
```rust
let mut bytes = [0u8; N];         // N-byte buffer  (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓
```

The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.

This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context to write a thorough review.


Code Review: perf(levm): optimize PUSH opcode handlers

Overview

This PR applies two assembly-level optimizations to PUSH opcode handlers. The changes are small but the correctness analysis reveals an important detail beyond the stated performance goal.


push.rs — Change 1: const-generic big-endian conversion

Silent bug fix in the partial-bytes branch

The most significant change here is correctness, not performance. The Some(data) arm (partial bytecode at end-of-code) was previously:

// OLD — INCORRECT per EVM spec
let mut bytes = [0; 32];
bytes[..data.len()].copy_from_slice(data);
U256::from_big_endian(&bytes)

For PUSH4 with 2 remaining bytes [0xAB, 0xCD], this produced:

bytes = [0xAB, 0xCD, 0x00, ..., 0x00]  ← 32 bytes, data at MSB end
from_big_endian → 0xABCD * 2^240       ← WRONG

The EVM spec requires zero-padding on the right (LSB side), producing 0xABCD_0000. The new code:

// NEW — CORRECT
let mut bytes = [0u8; N];              // N=4 → 4 bytes
bytes[..data.len()].copy_from_slice(data); // [0xAB, 0xCD, 0x00, 0x00]
u256_from_big_endian_const(bytes)      // → 0xABCD_0000 ✓

This bug fix should be explicitly documented in the PR description. It affects any PUSH opcode that spans the end of a contract's bytecode (an edge case, but valid per spec). Tests covering this scenario are warranted.


Missing #[inline] on u256_from_big_endian_const

The claimed optimization — LLVM seeing zero upper limbs for PUSH1–PUSH31 and eliding stores — only works if the function body is inlined into the PUSH handler at the LLVM IR level. Without #[inline], cross-crate calls may not be inlined even with monomorphization, leaving the intermediate 32-byte padded alloca in place:

// crates/common/utils.rs:27 — missing #[inline]
pub fn u256_from_big_endian_const<const N: usize>(slice: [u8; N]) -> U256 {
    let mut padded = [0u8; 32];  // ← this alloca is what we're trying to eliminate
    padded[32 - N..32].copy_from_slice(&slice);
    ...
}

Adding #[inline] (or #[inline(always)], consistent with u256_to_big_endian at line 46 of the same file) is needed to guarantee the optimization materializes in non-LTO builds.


Redundant copy in the happy path

The normal case (data.len() >= N) now performs two copies:

  • data[..N] → buf (in push.rs:49)
  • buf → padded[32-N..] (inside u256_from_big_endian_const)

With inlining these should fold into one, but it's worth confirming with cargo asm that the optimization fires as intended.


call_frame.rs — Change 2: eliminate stack-frame spill in Stack::push

// Before
std::ptr::copy_nonoverlapping(value.0.as_ptr(), slot.as_mut_ptr(), U64_PER_U256);

// After
slot.0[0] = value.0[0];
slot.0[1] = value.0[1];
slot.0[2] = value.0[2];
slot.0[3] = value.0[3];

The change is correct and functionally equivalent. The LLVM SROA argument is sound: a copy_nonoverlapping of 4 × u64 generates a memcpy IR node that LLVM's SROA pass may bundle with the U256 alloca into a [24 x i8] aggregate, preventing scalar reasoning about individual limbs. Individual stores let LLVM track each limb as an independent i64 scalar.

Minor: the comment is clear and thorough. One note — this is an LLVM implementation detail that could theoretically regress across compiler versions. A brief mention that this was validated against a specific rustc/LLVM version (or a reference to the benchmark/asm output) would help future maintainers assess whether the optimization is still firing.

Safety: Within the unsafe block, slot.0[0..3] are compile-time constant indices into a [u64; 4] array; bounds checks are elided by the compiler. No new unsafety is introduced.


Summary

Item Severity Notes
Bug fix in partial-PUSH bytes not documented Medium Silently fixes incorrect EVM spec behavior; needs a test and PR note
Missing #[inline] on u256_from_big_endian_const Medium Required for the stated optimization to reliably materialize
Individual limb stores in Stack::push Good Correct, well-commented, sound LLVM argument
Code readability Improved Separating value computation from stack.push() is cleaner

The functional correctness of the partial-bytes case and the #[inline] gap are the two items worth addressing before merge.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

  1. Non-blocking: push.rs changes a consensus-sensitive path, but I could not find a regression test for truncated PUSHn immediates at EOF. Please add cases like PUSH2 0x01 <EOF> => 0x0100 and PUSH32 0x01 <EOF> => 0x01 << 248 to lock in the padding semantics.

I did not find any blocking correctness, security, gas-accounting, or memory-safety issues in the touched code. The manual limb stores in call_frame.rs are behaviorally equivalent to the previous copy_nonoverlapping, and the new u256_from_big_endian_const path in push.rs preserves the existing big-endian decoding for both full and short immediates.

I could not run the Rust tests in this environment because cargo needs to write under ~/.cargo/~/.rustup and fetch dependencies, which is blocked here.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

// Store each limb individually so LLVM treats them as 4 independent i64 scalars.
// This prevents LLVM from grouping limbs[1..3] into a [24 x i8] alloca that would
// then need a memset + memcpy round-trip for values with known-zero upper limbs
// (e.g. PUSH1-PUSH31), allowing it to emit direct zero stores instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PUSH25-PUSH31 don't have known-zero upper limbs

@github-project-automation github-project-automation Bot moved this to In Review in ethrex_l1 Mar 24, 2026
@azteca1998 azteca1998 enabled auto-merge March 26, 2026 11:39
@github-actions

Copy link
Copy Markdown

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 61.665 ± 0.214 61.513 62.226 1.01 ± 0.01
head 61.043 ± 0.235 60.779 61.617 1.00

@azteca1998 azteca1998 added this pull request to the merge queue Mar 26, 2026
Merged via the queue into main with commit 3d673ac Mar 26, 2026
83 of 85 checks passed
@azteca1998 azteca1998 deleted the levm-push-opcode-opt branch March 26, 2026 13:29
@github-project-automation github-project-automation Bot moved this from Todo to Done in ethrex_performance Mar 26, 2026
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Mar 26, 2026
edg-l pushed a commit that referenced this pull request Mar 26, 2026
## Summary

This PR applies two complementary assembly-level optimizations to the
PUSH1–PUSH32 opcode handlers, identified by analyzing the generated
aarch64 and x86_64 assembly output.

### Change 1: Use const-generic big-endian conversion (`push.rs`)

Replaces `U256::from_big_endian(&data[..N])` (runtime slice) with
`u256_from_big_endian_const::<N>(buf)` (const-generic fixed-size array).

Because `N` is a compile-time constant at each monomorphized
`OpPushHandler<N>`, the compiler can:
- Compute the padding offset `32 - N` at compile time
- Operate on a fixed-size `[u8; N]` buffer instead of a runtime slice,
enabling better autovectorization of the byte copy

### Change 2: Eliminate stack-frame spill in `Stack::push`
(`call_frame.rs`)

Replaces the previous `copy_nonoverlapping` (and later direct struct
assignment) with individual stores for each of the 4 U256 limbs.

**Root cause of the spill:** LLVM's SROA pass decomposes a `U256` value
into `limb[0]` (scalar i64) and `limbs[1..3]` (a `[24 x i8]` stack
alloca). A struct assignment or `copy_nonoverlapping` still uses
`memcpy` for the alloca portion, causing a `memset(alloca, 0) +
memcpy(alloca → slot)` round-trip.

By storing each limb individually, LLVM treats all 4 as independent i64
scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and
eliminates the alloca entirely.

**Before** (PUSH1 fast path, 5 memory ops):
```asm
str   x9, [x11]            ; limb[0]
ldur  q0, [sp, #8]         ; reload zeros from frame  ← wasted
stur  q0, [x11, #8]        ; limbs[1..2]
ldr   x9, [sp, #24]        ; reload zeros from frame  ← wasted
str   x9, [x11, #24]       ; limb[3]
```

**After** (PUSH1 fast path, 2 memory ops, no stack frame):
```asm
stp   x9, xzr, [x11]       ; limb[0] + limb[1]
stp   xzr, xzr, [x11, #16] ; limb[2] + limb[3]
```

The same spill/reload pattern was confirmed on x86_64 (using
`xorps`/`movaps`/`movq` through the stack frame).

## Test plan

- [x] `cargo check -p ethrex-levm`
- [x] `cargo test -p ethrex-levm --release`
- [x] `cargo test -p ethrex --release`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

levm Lambda EVM implementation performance Block execution throughput and performance in general

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants