perf(levm): optimize PUSH opcode handlers by azteca1998 · Pull Request #6390 · lambdaclass/ethrex

azteca1998 · 2026-03-19T18:08:31Z

Summary

This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output.

Change 1: Use const-generic big-endian conversion (`push.rs`)

Replaces U256::from_big_endian(&data[..N]) (runtime slice) with u256_from_big_endian_const::<N>(buf) (const-generic fixed-size array).

Because N is a compile-time constant at each monomorphized OpPushHandler<N>, the compiler can:

Compute the padding offset 32 - N at compile time
Operate on a fixed-size [u8; N] buffer instead of a runtime slice, enabling better autovectorization of the byte copy

Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`)

Replaces the previous copy_nonoverlapping (and later direct struct assignment) with individual stores for each of the 4 U256 limbs.

Root cause of the spill: LLVM's SROA pass decomposes a U256 value into limb[0] (scalar i64) and limbs[1..3] (a [24 x i8] stack alloca). A struct assignment or copy_nonoverlapping still uses memcpy for the alloca portion, causing a memset(alloca, 0) + memcpy(alloca → slot) round-trip.

By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely.

Before (PUSH1 fast path, 5 memory ops):

str   x9, [x11]            ; limb[0]
ldur  q0, [sp, #8]         ; reload zeros from frame  ← wasted
stur  q0, [x11, #8]        ; limbs[1..2]
ldr   x9, [sp, #24]        ; reload zeros from frame  ← wasted
str   x9, [x11, #24]       ; limb[3]

After (PUSH1 fast path, 2 memory ops, no stack frame):

stp   x9, xzr, [x11]       ; limb[0] + limb[1]
stp   xzr, xzr, [x11, #16] ; limb[2] + limb[3]

The same spill/reload pattern was confirmed on x86_64 (using xorps/movaps/movq through the stack frame).

Test plan

cargo check -p ethrex-levm
cargo test -p ethrex-levm --release
cargo test -p ethrex --release

Replace U256::from_big_endian (runtime slice) with u256_from_big_endian_const (const-generic array) so the compiler knows N at monomorphization time, enabling compile-time offset computation and better auto-vectorization of the byte copy for every PUSHn variant.

…-limb values Store each U256 limb individually instead of using a struct assignment or copy_nonoverlapping. This prevents LLVM from grouping limbs[1..3] into a [24 x i8] stack alloca that then requires a memset + memcpy round-trip to reach the EVM stack slot. After the fix, LLVM sees all 4 limbs as independent i64 scalars and can prove that the upper limbs are zero constants for PUSH1-PUSH31, reducing the EVM stack write from 5 memory ops (3 spills + 2 reloads) to 2 stp instructions using the hardware zero register. Before (PUSH1 fast path): str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame str x9, [x11, #24] ; limb[3] After (PUSH1 fast path): stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3]

github-actions · 2026-03-19T18:09:42Z

🤖 Kimi Code Review

The changes are performance optimizations for EVM stack operations and PUSH opcode handling. No security vulnerabilities or consensus-critical bugs detected.

`crates/vm/levm/src/call_frame.rs`

Lines 91-101: The switch from ptr::copy_nonoverlapping to individual limb assignments is a valid LLVM optimization. The comment correctly explains the rationale—avoiding alloca coalescing that prevents zero-store elision for small PUSH values.

Safety: The unsafe block remains sound. The next_offset check at line 90 ensures the index is in bounds, and get_unchecked_mut is used appropriately.

Suggestion: Add a debug assertion documenting the U256 layout assumption (4 × u64 limbs), since the code now depends on the specific internal representation:

debug_assert_eq!(std::mem::size_of::<U256>(), 32);
debug_assert_eq!(U64_PER_U256, 4);

`crates/vm/levm/src/opcode_handlers/push.rs`

Lines 44-60: The refactor to use u256_from_big_endian_const is correct. The padding logic for partial bytecodes (when data.len() < N) properly places available bytes at the most significant positions (big-endian), leaving zeros in the least significant bytes—matching EVM consensus rules.

Note: The intermediate stack arrays ([0u8; N]) introduce a copy that wasn't present in the slice-based U256::from_big_endian call, but this is necessary for the const-generic API. Since N ≤ 32, the stack pressure is negligible.

Style: Consider consistent variable naming—buf in the first arm vs bytes in the second.

Import: Verify that u256_from_big_endian_const is marked #[inline] in ethrex_common to ensure the const-generic optimization actually propagates through the call boundary.

Overall: ACK. The changes are idiomatic Rust optimizations that preserve EVM semantics.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-03-19T18:11:21Z

Lines of code report

Total lines added: 0
Total lines removed: 1
Total lines changed: 1

Detailed view

+---------------------------------------------------+-------+------+
| File                                              | Lines | Diff |
+---------------------------------------------------+-------+------+
| ethrex/crates/vm/levm/src/opcode_handlers/push.rs | 48    | -1   |
+---------------------------------------------------+-------+------+

github-actions · 2026-03-19T18:11:51Z

🤖 Claude Code Review

Here is my review of PR 6390:

PR Review: `perf(levm): optimize PUSH opcode handlers`

Overall Assessment

Well-reasoned performance optimization with solid assembly-level justification. The changes are small, targeted, and the PR description is thorough. One correctness issue is worth calling out explicitly.

Correctness: Hidden Bug Fix in Partial-Data Case

The push.rs change silently fixes a real correctness bug in the Some(data) if data.len() < N arm. This is worth highlighting because it's not mentioned in the PR description.

Old behavior (push.rs, pre-patch Some(data) arm):

let mut bytes = [0; 32];
bytes[..data.len()].copy_from_slice(data);
U256::from_big_endian(&bytes)   // 32-byte buffer, data at MSB side

For PUSH4 with 2 remaining bytes [0xAB, 0xCD]:

bytes = [0xAB, 0xCD, 0x00, ..., 0x00] (32 bytes)
U256::from_big_endian gives 0xABCD * 2^240 — completely wrong

New behavior (correct EVM semantics — missing bytes are zero-padded on the LSB side):

let mut bytes = [0u8; N];       // N-byte buffer
bytes[..data.len()].copy_from_slice(data);
u256_from_big_endian_const(bytes) // pads to 32 by prepending zeros

For the same PUSH4 with [0xAB, 0xCD]:

bytes = [0xAB, 0xCD, 0x00, 0x00] (4 bytes)
After padding in u256_from_big_endian_const: [0x00×28, 0xAB, 0xCD, 0x00, 0x00]
Result: 0xABCD0000 — correct per Yellow Paper (zero-pad trailing)

This is an edge case (PUSH at the very end of bytecode), but it's spec-defined behavior. Consider documenting this fix explicitly in the PR description and/or as a comment.

`call_frame.rs` — Limb-by-Limb Store

// call_frame.rs:97-101
let slot = self.values.get_unchecked_mut(next_offset);
slot.0[0] = value.0[0];
slot.0[1] = value.0[1];
slot.0[2] = value.0[2];
slot.0[3] = value.0[3];

Semantically identical to the prior copy_nonoverlapping (4 × u64 = 32 bytes either way).
The unsafe justification is unchanged and remains valid — next_offset is already bounds-checked by checked_sub.
The assembly improvement described in the PR (2 stores vs. 5 for PUSH1) is plausible and consistent with LLVM SROA behavior.
No issues here.

`push.rs` — Normal Case Intermediate Copy

// push.rs:48-50
let mut buf = [0u8; N];
buf.copy_from_slice(&data[..N]);
u256_from_big_endian_const(buf)

The intermediate buf copy is intentional — it gives LLVM a fixed-size [u8; N] array rather than a slice, enabling compile-time constant folding of 32 - N and better autovectorization. No issues. The #[expect(clippy::indexing_slicing)] suppression is correctly placed on the guarded arm.

Minor Nits

u256_from_big_endian_const in crates/common/utils.rs:27 has no #[inline] attribute. Since OpPushHandler<N>::eval is #[inline(always)] and this is a small, monomorphized function, the compiler will almost certainly inline it — but an explicit #[inline] would be defensive and consistent with the hot-path intent.
The #[expect(clippy::indexing_slicing)] attribute moved from wrapping the whole stack.push(...) call to just the inner match arm. This is strictly more precise. No issue, minor style improvement.

Summary

The optimization is correct and well-motivated. The assembly walkthrough in the description is exactly the right level of justification for unsafe and layout-aware code. The main recommendation is to explicitly document the partial-data correctness fix — it is a behavioral change in a spec-defined code path, and reviewers/bisectors will want to know.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-03-19T18:14:39Z

🤖 Codex Code Review

Findings

No correctness, security, or gas-accounting issues stood out in this diff. The stack write change in call_frame.rs is behavior-preserving, and the PUSHn refactor in push.rs still matches EVM semantics for both full and truncated immediates.

Residual Risk

I would still add a regression test for truncated PUSHn bytecode at the decode path in push.rs. That case is consensus-sensitive, and this PR changes the helper used there. A simple case like PUSH2 0x12 at EOF should assert the pushed value is 0x1200, not 0x12 or a left-shifted 256-bit artifact.

Verification

I could not run cargo test in this environment because rustup attempted to create temp files under a read-only filesystem.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-03-19T18:22:20Z

Benchmark Results Comparison

No significant difference was registered for any benchmark run.

Detailed Results

Benchmark Results: BubbleSort

Command	Mean [s]	Min [s]	Max [s]	Relative
`main_revm_BubbleSort`	2.963 ± 0.027	2.939	3.033	1.10 ± 0.01
`main_levm_BubbleSort`	2.718 ± 0.052	2.681	2.853	1.01 ± 0.02
`pr_revm_BubbleSort`	2.972 ± 0.022	2.947	3.028	1.10 ± 0.01
`pr_levm_BubbleSort`	2.703 ± 0.019	2.681	2.745	1.00

Benchmark Results: ERC20Approval

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Approval`	981.8 ± 13.9	970.4	1017.9	1.01 ± 0.01
`main_levm_ERC20Approval`	1033.6 ± 16.8	1021.6	1078.4	1.06 ± 0.02
`pr_revm_ERC20Approval`	972.0 ± 2.6	969.1	976.3	1.00
`pr_levm_ERC20Approval`	1023.7 ± 3.9	1016.5	1028.6	1.05 ± 0.00

Benchmark Results: ERC20Mint

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Mint`	131.2 ± 0.7	130.1	132.4	1.00 ± 0.01
`main_levm_ERC20Mint`	149.0 ± 0.6	148.4	150.1	1.14 ± 0.01
`pr_revm_ERC20Mint`	131.0 ± 0.6	129.9	131.8	1.00
`pr_levm_ERC20Mint`	149.4 ± 0.8	148.2	151.1	1.14 ± 0.01

Benchmark Results: ERC20Transfer

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ERC20Transfer`	230.6 ± 1.0	229.3	232.2	1.00
`main_levm_ERC20Transfer`	253.7 ± 2.0	251.8	257.8	1.10 ± 0.01
`pr_revm_ERC20Transfer`	231.2 ± 2.0	229.1	234.5	1.00 ± 0.01
`pr_levm_ERC20Transfer`	254.8 ± 1.7	252.6	258.1	1.10 ± 0.01

Benchmark Results: Factorial

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Factorial`	224.6 ± 3.3	222.8	233.8	1.00 ± 0.02
`main_levm_Factorial`	267.0 ± 3.0	263.1	272.3	1.19 ± 0.01
`pr_revm_Factorial`	223.7 ± 0.7	222.6	225.0	1.00
`pr_levm_Factorial`	259.7 ± 1.2	258.2	261.5	1.16 ± 0.01

Benchmark Results: FactorialRecursive

Command	Mean [s]	Min [s]	Max [s]	Relative
`main_revm_FactorialRecursive`	1.574 ± 0.044	1.471	1.620	1.00
`main_levm_FactorialRecursive`	1.646 ± 0.011	1.632	1.663	1.05 ± 0.03
`pr_revm_FactorialRecursive`	1.607 ± 0.045	1.546	1.668	1.02 ± 0.04
`pr_levm_FactorialRecursive`	1.594 ± 0.008	1.586	1.608	1.01 ± 0.03

Benchmark Results: Fibonacci

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Fibonacci`	203.9 ± 2.3	199.0	208.1	1.00 ± 0.01
`main_levm_Fibonacci`	230.2 ± 2.3	227.9	235.6	1.13 ± 0.01
`pr_revm_Fibonacci`	203.5 ± 1.0	201.5	204.9	1.00
`pr_levm_Fibonacci`	232.7 ± 19.3	225.1	287.3	1.14 ± 0.10

Benchmark Results: FibonacciRecursive

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_FibonacciRecursive`	835.2 ± 7.5	818.5	842.5	1.18 ± 0.05
`main_levm_FibonacciRecursive`	708.3 ± 30.0	690.8	792.5	1.00
`pr_revm_FibonacciRecursive`	833.9 ± 11.6	820.6	852.7	1.18 ± 0.05
`pr_levm_FibonacciRecursive`	717.2 ± 31.6	691.7	805.1	1.01 ± 0.06

Benchmark Results: ManyHashes

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_ManyHashes`	8.4 ± 0.1	8.3	8.7	1.01 ± 0.02
`main_levm_ManyHashes`	9.9 ± 0.0	9.8	9.9	1.17 ± 0.01
`pr_revm_ManyHashes`	8.4 ± 0.1	8.3	8.5	1.00
`pr_levm_ManyHashes`	9.9 ± 0.1	9.8	10.0	1.18 ± 0.01

Benchmark Results: MstoreBench

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_MstoreBench`	264.5 ± 3.9	260.9	274.6	1.13 ± 0.02
`main_levm_MstoreBench`	233.6 ± 0.9	232.7	235.8	1.00
`pr_revm_MstoreBench`	268.3 ± 18.8	260.6	321.7	1.15 ± 0.08
`pr_levm_MstoreBench`	234.5 ± 1.0	232.6	235.8	1.00 ± 0.01

Benchmark Results: Push

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_Push`	287.4 ± 1.4	285.4	290.6	1.03 ± 0.01
`main_levm_Push`	279.0 ± 3.8	276.1	287.3	1.00 ± 0.02
`pr_revm_Push`	287.7 ± 3.4	283.2	293.6	1.04 ± 0.02
`pr_levm_Push`	277.9 ± 2.7	275.2	283.9	1.00

Benchmark Results: SstoreBench_no_opt

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`main_revm_SstoreBench_no_opt`	167.9 ± 2.0	164.4	171.3	1.69 ± 0.02
`main_levm_SstoreBench_no_opt`	99.4 ± 0.3	99.0	99.8	1.00
`pr_revm_SstoreBench_no_opt`	164.6 ± 3.9	160.6	169.3	1.66 ± 0.04
`pr_levm_SstoreBench_no_opt`	99.6 ± 0.8	98.9	101.0	1.00 ± 0.01

github-actions · 2026-03-19T21:21:47Z

🤖 Kimi Code Review

The changes look correct and represent legitimate micro-optimizations for EVM execution. A few specific observations:

crates/vm/levm/src/call_frame.rs (lines 91-102)

The manual limb assignment is a valid optimization to help LLVM scalarize the U256 store, avoiding memset/memcpy overhead for small PUSH values. The unsafe block is properly justified and bounded.
Ensure that ethrex_common::U256 guarantees a stable internal representation with field 0 as [u64; 4] (standard for primitive-types U256, but worth confirming this crate doesn't use a wrapper that hides the field).

crates/vm/levm/src/opcode_handlers/push.rs (lines 10, 44-60)

The switch to u256_from_big_endian_const correctly handles the big-endian byte ordering required by PUSH opcodes.
The padding logic for truncated bytecode (lines 54-58) correctly zero-fills the remaining bytes when data.len() < N, maintaining consensus compatibility with out-of-bounds reads returning 0.
Minor nit: In the first match arm (lines 48-51), the array initialization [0u8; N] is technically redundant since you immediately overwrite all N bytes with copy_from_slice. You could use unsafe { *(data[..N].as_ptr() as *const [u8; N]) } or data[..N].try_into().unwrap() since the length is already verified, but the current approach is safe and clear.

Consensus & Security

No consensus issues detected. The behavior for PUSH operations with insufficient bytecode remains correct (zero-padding).
No new panic vectors introduced; all indexing is guarded by length checks or uses const-generic arrays.

Suggestion
Consider adding a debug assertion or comment in push.rs confirming that N is within 1..=32 (since PUSH0 is handled separately), though the const-generic bounds likely enforce this at compile time.

Overall: LGTM - These are well-documented, targeted optimizations that maintain correctness.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

greptile-apps · 2026-03-19T21:24:37Z

Greptile Summary

This PR applies two assembly-level micro-optimizations to the PUSH1–PUSH32 opcode handlers in LEVM: switching to a const-generic u256_from_big_endian_const so LLVM can operate on fixed-size [u8; N] arrays instead of runtime slices, and replacing copy_nonoverlapping with four explicit limb assignments in Stack::push so LLVM treats all limbs as independent i64 scalars and eliminates the alloca spill/reload pattern for values with known-zero upper limbs. Both changes are well-motivated by the generated assembly analysis in the PR description and are semantically correct for the common code path.

Key observations:

The call_frame.rs change is a pure mechanical substitution — copy_nonoverlapping of 4 u64s is exactly equivalent to four individual assignments, and the safety invariant (get_unchecked_mut is guarded by the prior checked_sub on the offset) is unchanged.
The push.rs change also silently fixes a correctness bug in the truncated-bytecode edge case (Some(data) arm where data.len() < N). The old code placed partial literal bytes at the most-significant end of a 32-byte buffer, producing a wildly inflated value. The new code correctly zero-pads on the right within the N-byte window, matching EVM spec behaviour (missing bytes treated as 0x00 appended to the right of the literal). This fix is not documented in the PR description and has no dedicated test.
u256_from_big_endian_const is a const-generic function, so Rust monomorphizes it per calling crate — the lack of an explicit #[inline] attribute is not a concern here.

Confidence Score: 4/5

Safe to merge; the hot-path is functionally identical to before and the one behavioral change is a correctness improvement per EVM spec.
Both changed files are small and well-reasoned. The call_frame.rs change is a straightforward mechanical substitution with no semantic difference. The push.rs change is correct and also incidentally fixes a long-standing edge-case bug (truncated PUSH literal). The only risk is the undocumented behavioral change for truncated bytecodes, but the new behavior is more spec-compliant and existing tests pass. A dedicated test for the truncated-bytecode path would increase confidence to 5.
crates/vm/levm/src/opcode_handlers/push.rs — contains an undocumented correctness fix in the Some(data) arm that should be documented and tested.

Important Files Changed

Filename	Overview
crates/vm/levm/src/opcode_handlers/push.rs	Switches to const-generic `u256_from_big_endian_const` for better LLVM codegen. Also silently fixes a correctness bug in the truncated-bytecode edge case (partial PUSH literal at end-of-code now zero-pads correctly on the right instead of the left), but this fix is undocumented in the PR description.
crates/vm/levm/src/call_frame.rs	Replaces `copy_nonoverlapping` with explicit per-limb assignments in `Stack::push`. Semantically equivalent; helps LLVM treat the four `u64` limbs as independent scalars, eliminating the alloca spill/reload round-trip for PUSH1–PUSH31 where upper limbs are constant zeros.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["OpPushHandler&lt;N&gt;::eval"] --> B["Read literal_offset from bytecode"]
    B --> C{bytecode.get\nliteral_offset..}
    C -->|"Some(data)\ndata.len() >= N"| D["buf = [0u8; N]\nbuf.copy_from_slice(&data[..N])\nu256_from_big_endian_const(buf)"]
    C -->|"Some(data)\ndata.len() < N\n(truncated bytecode)"| E["bytes = [0u8; N]\nbytes[..data.len()].copy_from_slice(data)\nu256_from_big_endian_const(bytes)\n✅ correct: zero-pad on right"]
    C -->|None| F["U256::zero()"]
    D --> G["Stack::push(value)"]
    E --> G
    F --> G
    G --> H["get_unchecked_mut(next_offset)"]
    H --> I["slot.0[0] = value.0[0]\nslot.0[1] = value.0[1]\nslot.0[2] = value.0[2]\nslot.0[3] = value.0[3]"]
    I --> J["self.offset = next_offset"]
    J --> K["OpcodeResult::Continue"]

    style E fill:#d4edda,stroke:#28a745
    style D fill:#cfe2ff,stroke:#0d6efd

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/vm/levm/src/opcode_handlers/push.rs
Line: 52-56

Comment:
**Silent behavioral change in truncated-bytecode path**

This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description.

**Old code:**
```rust
let mut bytes = [0; 32];          // 32-byte buffer
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end
U256::from_big_endian(&bytes)
// → data is treated as the most-significant bytes of a 32-byte word
// → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)
```

**New code:**
```rust
let mut bytes = [0u8; N];         // N-byte buffer  (e.g. 4 for PUSH4)
bytes[..data.len()].copy_from_slice(data);  // data placed at MSB end of the N-byte window
u256_from_big_endian_const(bytes)
// → pads [data | zeros] to 32 bytes at the LSB side
// → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓
```

The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.

This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "perf(levm): eliminat..."}

greptile-apps · 2026-03-19T21:24:41Z

+            Some(data) => {
+                let mut bytes = [0u8; N];
+                bytes[..data.len()].copy_from_slice(data);
+                u256_from_big_endian_const(bytes)
+            }


Silent behavioral change in truncated-bytecode path

This arm is reached when data.len() < N (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave differently here, and that difference isn't noted in the PR description.

Old code:

let mut bytes = [0; 32]; // 32-byte buffer bytes[..data.len()].copy_from_slice(data); // data placed at MSB end U256::from_big_endian(&bytes) // → data is treated as the most-significant bytes of a 32-byte word // → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge)

New code:

let mut bytes = [0u8; N]; // N-byte buffer (e.g. 4 for PUSH4) bytes[..data.len()].copy_from_slice(data); // data placed at MSB end of the N-byte window u256_from_big_endian_const(bytes) // → pads [data | zeros] to 32 bytes at the LSB side // → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓

The new behaviour is correct per the EVM specification (missing bytes are treated as 0x00 appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value.

This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-PUSH4) to lock in the correct behaviour.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/vm/levm/src/opcode_handlers/push.rs Line: 52-56 Comment: **Silent behavioral change in truncated-bytecode path** This arm is reached when `data.len() < N` (the PUSH literal extends beyond the end of bytecode). The old code and the new code behave **differently** here, and that difference isn't noted in the PR description. **Old code:** ```rust let mut bytes = [0; 32]; // 32-byte buffer bytes[..data.len()].copy_from_slice(data); // data placed at MSB end U256::from_big_endian(&bytes) // → data is treated as the most-significant bytes of a 32-byte word // → PUSH4 with [0x01, 0x02] remaining → 0x0102_0000…0000 (huge) ``` **New code:** ```rust let mut bytes = [0u8; N]; // N-byte buffer (e.g. 4 for PUSH4) bytes[..data.len()].copy_from_slice(data); // data placed at MSB end of the N-byte window u256_from_big_endian_const(bytes) // → pads [data | zeros] to 32 bytes at the LSB side // → PUSH4 with [0x01, 0x02] remaining → 0x01020000 ✓ ``` The new behaviour is **correct** per the EVM specification (missing bytes are treated as `0x00` appended to the right of the literal), while the old code was effectively shifting the partial literal to the most-significant word position, producing a wildly wrong value. This is a correctness fix and should be explicitly called out in the PR description so that reviewers and bisect tooling don't accidentally revert it. It would also be worth adding a targeted unit test for this edge case (e.g. a contract whose bytecode ends mid-`PUSH4`) to lock in the correct behaviour. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-03-19T21:25:11Z

🤖 Claude Code Review

Now I have enough context to write a thorough review.

Code Review: `perf(levm): optimize PUSH opcode handlers`

Overview

This PR applies two assembly-level optimizations to PUSH opcode handlers. The changes are small but the correctness analysis reveals an important detail beyond the stated performance goal.

`push.rs` — Change 1: const-generic big-endian conversion

Silent bug fix in the partial-bytes branch

The most significant change here is correctness, not performance. The Some(data) arm (partial bytecode at end-of-code) was previously:

// OLD — INCORRECT per EVM spec
let mut bytes = [0; 32];
bytes[..data.len()].copy_from_slice(data);
U256::from_big_endian(&bytes)

For PUSH4 with 2 remaining bytes [0xAB, 0xCD], this produced:

bytes = [0xAB, 0xCD, 0x00, ..., 0x00]  ← 32 bytes, data at MSB end
from_big_endian → 0xABCD * 2^240       ← WRONG

The EVM spec requires zero-padding on the right (LSB side), producing 0xABCD_0000. The new code:

// NEW — CORRECT
let mut bytes = [0u8; N];              // N=4 → 4 bytes
bytes[..data.len()].copy_from_slice(data); // [0xAB, 0xCD, 0x00, 0x00]
u256_from_big_endian_const(bytes)      // → 0xABCD_0000 ✓

This bug fix should be explicitly documented in the PR description. It affects any PUSH opcode that spans the end of a contract's bytecode (an edge case, but valid per spec). Tests covering this scenario are warranted.

Missing #[inline] on u256_from_big_endian_const

The claimed optimization — LLVM seeing zero upper limbs for PUSH1–PUSH31 and eliding stores — only works if the function body is inlined into the PUSH handler at the LLVM IR level. Without #[inline], cross-crate calls may not be inlined even with monomorphization, leaving the intermediate 32-byte padded alloca in place:

// crates/common/utils.rs:27 — missing #[inline]
pub fn u256_from_big_endian_const<const N: usize>(slice: [u8; N]) -> U256 {
    let mut padded = [0u8; 32];  // ← this alloca is what we're trying to eliminate
    padded[32 - N..32].copy_from_slice(&slice);
    ...
}

Adding #[inline] (or #[inline(always)], consistent with u256_to_big_endian at line 46 of the same file) is needed to guarantee the optimization materializes in non-LTO builds.

Redundant copy in the happy path

The normal case (data.len() >= N) now performs two copies:

data[..N] → buf (in push.rs:49)
buf → padded[32-N..] (inside u256_from_big_endian_const)

With inlining these should fold into one, but it's worth confirming with cargo asm that the optimization fires as intended.

`call_frame.rs` — Change 2: eliminate stack-frame spill in `Stack::push`

// Before
std::ptr::copy_nonoverlapping(value.0.as_ptr(), slot.as_mut_ptr(), U64_PER_U256);

// After
slot.0[0] = value.0[0];
slot.0[1] = value.0[1];
slot.0[2] = value.0[2];
slot.0[3] = value.0[3];

The change is correct and functionally equivalent. The LLVM SROA argument is sound: a copy_nonoverlapping of 4 × u64 generates a memcpy IR node that LLVM's SROA pass may bundle with the U256 alloca into a [24 x i8] aggregate, preventing scalar reasoning about individual limbs. Individual stores let LLVM track each limb as an independent i64 scalar.

Minor: the comment is clear and thorough. One note — this is an LLVM implementation detail that could theoretically regress across compiler versions. A brief mention that this was validated against a specific rustc/LLVM version (or a reference to the benchmark/asm output) would help future maintainers assess whether the optimization is still firing.

Safety: Within the unsafe block, slot.0[0..3] are compile-time constant indices into a [u64; 4] array; bounds checks are elided by the compiler. No new unsafety is introduced.

Summary

Item	Severity	Notes
Bug fix in partial-PUSH bytes not documented	Medium	Silently fixes incorrect EVM spec behavior; needs a test and PR note
Missing `#[inline]` on `u256_from_big_endian_const`	Medium	Required for the stated optimization to reliably materialize
Individual limb stores in `Stack::push`	Good	Correct, well-commented, sound LLVM argument
Code readability	Improved	Separating value computation from `stack.push()` is cleaner

The functional correctness of the partial-bytes case and the #[inline] gap are the two items worth addressing before merge.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-03-19T21:27:23Z

🤖 Codex Code Review

Non-blocking: push.rs changes a consensus-sensitive path, but I could not find a regression test for truncated PUSHn immediates at EOF. Please add cases like PUSH2 0x01 <EOF> => 0x0100 and PUSH32 0x01 <EOF> => 0x01 << 248 to lock in the padding semantics.

I did not find any blocking correctness, security, gas-accounting, or memory-safety issues in the touched code. The manual limb stores in call_frame.rs are behaviorally equivalent to the previous copy_nonoverlapping, and the new u256_from_big_endian_const path in push.rs preserves the existing big-endian decoding for both full and short immediates.

I could not run the Rust tests in this environment because cargo needs to write under ~/.cargo/~/.rustup and fetch dependencies, which is blocked here.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

iovoid · 2026-03-20T14:26:56Z

+        // Store each limb individually so LLVM treats them as 4 independent i64 scalars.
+        // This prevents LLVM from grouping limbs[1..3] into a [24 x i8] alloca that would
+        // then need a memset + memcpy round-trip for values with known-zero upper limbs
+        // (e.g. PUSH1-PUSH31), allowing it to emit direct zero stores instead.


nit: PUSH25-PUSH31 don't have known-zero upper limbs

github-actions · 2026-03-26T12:23:14Z

Benchmark Block Execution Results Comparison Against Main

Command	Mean [s]	Min [s]	Max [s]	Relative
`base`	61.665 ± 0.214	61.513	62.226	1.01 ± 0.01
`head`	61.043 ± 0.235	60.779	61.617	1.00

## Summary This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output. ### Change 1: Use const-generic big-endian conversion (`push.rs`) Replaces `U256::from_big_endian(&data[..N])` (runtime slice) with `u256_from_big_endian_const::<N>(buf)` (const-generic fixed-size array). Because `N` is a compile-time constant at each monomorphized `OpPushHandler<N>`, the compiler can: - Compute the padding offset `32 - N` at compile time - Operate on a fixed-size `[u8; N]` buffer instead of a runtime slice, enabling better autovectorization of the byte copy ### Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`) Replaces the previous `copy_nonoverlapping` (and later direct struct assignment) with individual stores for each of the 4 U256 limbs. **Root cause of the spill:** LLVM's SROA pass decomposes a `U256` value into `limb[0]` (scalar i64) and `limbs[1..3]` (a `[24 x i8]` stack alloca). A struct assignment or `copy_nonoverlapping` still uses `memcpy` for the alloca portion, causing a `memset(alloca, 0) + memcpy(alloca → slot)` round-trip. By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely. **Before** (PUSH1 fast path, 5 memory ops): ```asm str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame ← wasted stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame ← wasted str x9, [x11, #24] ; limb[3] ``` **After** (PUSH1 fast path, 2 memory ops, no stack frame): ```asm stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3] ``` The same spill/reload pattern was confirmed on x86_64 (using `xorps`/`movaps`/`movq` through the stack frame). ## Test plan - [x] `cargo check -p ethrex-levm` - [x] `cargo test -p ethrex-levm --release` - [x] `cargo test -p ethrex --release`

azteca1998 added 2 commits March 19, 2026 18:35

github-actions Bot assigned azteca1998 Mar 19, 2026

github-actions Bot added levm Lambda EVM implementation performance Block execution throughput and performance in general labels Mar 19, 2026

github-project-automation Bot added this to ethrex_l1 and ethrex_performance Mar 19, 2026

github-project-automation Bot moved this to Todo in ethrex_performance Mar 19, 2026

azteca1998 marked this pull request as ready for review March 19, 2026 21:20

azteca1998 requested a review from a team as a code owner March 19, 2026 21:20

greptile-apps Bot reviewed Mar 19, 2026

View reviewed changes

iovoid approved these changes Mar 20, 2026

View reviewed changes

ElFantasma approved these changes Mar 20, 2026

View reviewed changes

edg-l approved these changes Mar 24, 2026

View reviewed changes

github-project-automation Bot moved this to In Review in ethrex_l1 Mar 24, 2026

chore: update CHANGELOG for PR #6390

6d3efd8

azteca1998 enabled auto-merge March 26, 2026 11:39

azteca1998 added this pull request to the merge queue Mar 26, 2026

Merged via the queue into main with commit 3d673ac Mar 26, 2026
83 of 85 checks passed

azteca1998 deleted the levm-push-opcode-opt branch March 26, 2026 13:29

github-project-automation Bot moved this from Todo to Done in ethrex_performance Mar 26, 2026

github-project-automation Bot moved this from In Review to Done in ethrex_l1 Mar 26, 2026

Conversation

azteca1998 commented Mar 19, 2026

Summary

Change 1: Use const-generic big-endian conversion (push.rs)

Change 2: Eliminate stack-frame spill in Stack::push (call_frame.rs)

Test plan

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Kimi Code Review

crates/vm/levm/src/call_frame.rs

crates/vm/levm/src/opcode_handlers/push.rs

Uh oh!

github-actions Bot commented Mar 19, 2026

Lines of code report

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Claude Code Review

PR Review: perf(levm): optimize PUSH opcode handlers

Overall Assessment

Correctness: Hidden Bug Fix in Partial-Data Case

call_frame.rs — Limb-by-Limb Store

push.rs — Normal Case Intermediate Copy

Minor Nits

Summary

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results Comparison

Benchmark Results: BubbleSort

Benchmark Results: ERC20Approval

Benchmark Results: ERC20Mint

Benchmark Results: ERC20Transfer

Benchmark Results: Factorial

Benchmark Results: FactorialRecursive

Benchmark Results: Fibonacci

Benchmark Results: FibonacciRecursive

Benchmark Results: ManyHashes

Benchmark Results: MstoreBench

Benchmark Results: Push

Benchmark Results: SstoreBench_no_opt

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Kimi Code Review

Uh oh!

greptile-apps Bot commented Mar 19, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Claude Code Review

Code Review: perf(levm): optimize PUSH opcode handlers

Overview

push.rs — Change 1: const-generic big-endian conversion

call_frame.rs — Change 2: eliminate stack-frame spill in Stack::push

Summary

Uh oh!

github-actions Bot commented Mar 19, 2026

🤖 Codex Code Review

Uh oh!

iovoid Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 26, 2026

Benchmark Block Execution Results Comparison Against Main

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Change 1: Use const-generic big-endian conversion (`push.rs`)

Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`)

`crates/vm/levm/src/call_frame.rs`

`crates/vm/levm/src/opcode_handlers/push.rs`

PR Review: `perf(levm): optimize PUSH opcode handlers`

`call_frame.rs` — Limb-by-Limb Store

`push.rs` — Normal Case Intermediate Copy

github-actions Bot commented Mar 19, 2026 •

edited

Loading

Code Review: `perf(levm): optimize PUSH opcode handlers`

`push.rs` — Change 1: const-generic big-endian conversion

`call_frame.rs` — Change 2: eliminate stack-frame spill in `Stack::push`