build: add Dockerfile.#8
Merged
Merged
Conversation
juanbono
approved these changes
Jun 5, 2024
This was referenced Feb 6, 2026
azteca1998
added a commit
that referenced
this pull request
Mar 19, 2026
…-limb values Store each U256 limb individually instead of using a struct assignment or copy_nonoverlapping. This prevents LLVM from grouping limbs[1..3] into a [24 x i8] stack alloca that then requires a memset + memcpy round-trip to reach the EVM stack slot. After the fix, LLVM sees all 4 limbs as independent i64 scalars and can prove that the upper limbs are zero constants for PUSH1-PUSH31, reducing the EVM stack write from 5 memory ops (3 spills + 2 reloads) to 2 stp instructions using the hardware zero register. Before (PUSH1 fast path): str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame str x9, [x11, #24] ; limb[3] After (PUSH1 fast path): stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3]
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Mar 26, 2026
## Summary This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output. ### Change 1: Use const-generic big-endian conversion (`push.rs`) Replaces `U256::from_big_endian(&data[..N])` (runtime slice) with `u256_from_big_endian_const::<N>(buf)` (const-generic fixed-size array). Because `N` is a compile-time constant at each monomorphized `OpPushHandler<N>`, the compiler can: - Compute the padding offset `32 - N` at compile time - Operate on a fixed-size `[u8; N]` buffer instead of a runtime slice, enabling better autovectorization of the byte copy ### Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`) Replaces the previous `copy_nonoverlapping` (and later direct struct assignment) with individual stores for each of the 4 U256 limbs. **Root cause of the spill:** LLVM's SROA pass decomposes a `U256` value into `limb[0]` (scalar i64) and `limbs[1..3]` (a `[24 x i8]` stack alloca). A struct assignment or `copy_nonoverlapping` still uses `memcpy` for the alloca portion, causing a `memset(alloca, 0) + memcpy(alloca → slot)` round-trip. By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely. **Before** (PUSH1 fast path, 5 memory ops): ```asm str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame ← wasted stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame ← wasted str x9, [x11, #24] ; limb[3] ``` **After** (PUSH1 fast path, 2 memory ops, no stack frame): ```asm stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3] ``` The same spill/reload pattern was confirmed on x86_64 (using `xorps`/`movaps`/`movq` through the stack frame). ## Test plan - [x] `cargo check -p ethrex-levm` - [x] `cargo test -p ethrex-levm --release` - [x] `cargo test -p ethrex --release`
edg-l
pushed a commit
that referenced
this pull request
Mar 26, 2026
## Summary This PR applies two complementary assembly-level optimizations to the PUSH1–PUSH32 opcode handlers, identified by analyzing the generated aarch64 and x86_64 assembly output. ### Change 1: Use const-generic big-endian conversion (`push.rs`) Replaces `U256::from_big_endian(&data[..N])` (runtime slice) with `u256_from_big_endian_const::<N>(buf)` (const-generic fixed-size array). Because `N` is a compile-time constant at each monomorphized `OpPushHandler<N>`, the compiler can: - Compute the padding offset `32 - N` at compile time - Operate on a fixed-size `[u8; N]` buffer instead of a runtime slice, enabling better autovectorization of the byte copy ### Change 2: Eliminate stack-frame spill in `Stack::push` (`call_frame.rs`) Replaces the previous `copy_nonoverlapping` (and later direct struct assignment) with individual stores for each of the 4 U256 limbs. **Root cause of the spill:** LLVM's SROA pass decomposes a `U256` value into `limb[0]` (scalar i64) and `limbs[1..3]` (a `[24 x i8]` stack alloca). A struct assignment or `copy_nonoverlapping` still uses `memcpy` for the alloca portion, causing a `memset(alloca, 0) + memcpy(alloca → slot)` round-trip. By storing each limb individually, LLVM treats all 4 as independent i64 scalars, proves the upper limbs are zero constants for PUSH1–PUSH31, and eliminates the alloca entirely. **Before** (PUSH1 fast path, 5 memory ops): ```asm str x9, [x11] ; limb[0] ldur q0, [sp, #8] ; reload zeros from frame ← wasted stur q0, [x11, #8] ; limbs[1..2] ldr x9, [sp, #24] ; reload zeros from frame ← wasted str x9, [x11, #24] ; limb[3] ``` **After** (PUSH1 fast path, 2 memory ops, no stack frame): ```asm stp x9, xzr, [x11] ; limb[0] + limb[1] stp xzr, xzr, [x11, #16] ; limb[2] + limb[3] ``` The same spill/reload pattern was confirmed on x86_64 (using `xorps`/`movaps`/`movq` through the stack frame). ## Test plan - [x] `cargo check -p ethrex-levm` - [x] `cargo test -p ethrex-levm --release` - [x] `cargo test -p ethrex --release`
edg-l
added a commit
that referenced
this pull request
May 5, 2026
Run #8 still failed at switch_block+2 even with the cache-aware backend landed in f10fb7f: each post-switch block reading an account modified at a prior post-switch block (but not the latest) returned the MPT-base nonce, off-by-1. Root cause: `BinaryMerkleizer::new` started from `BinaryTrieState::new()` (empty), ignoring `_parent_root` and the `provider` (marked dead_code). Symmetric `MptMerkleizer::new` opens 16 shard workers each rooted at `parent_state_root` and lazy-loads via the provider, so the merkleizer's trie at the new root is the FULL post-parent state. The binary side was producing diffs from empty — so the in-memory trie at root R(N) contained a path only to accounts modified at block N. Read-path gates that walk `state.trie_get` (notably `stem_has_basic_data`) returned false for any account not in the latest block, and `TransitionBackend::account` fell through to the MPT base. Make BinaryMerkleizer mirror MptMerkleizer: - `BinaryTrieProvider::open_state()`: new trait method returning a `BinaryTrieState`. Default impl returns `BinaryTrieState::new()` (empty, for tests + `EmptyBinaryTrieProvider` + genesis bootstrap). - `StoreBinaryTrieProvider::open_state` overrides: opens against `CacheAwareTrieBackend`, so the trie is rooted at the live binary head (cache + disk), matching MPT's `MptTrieWrapper(state_root, trie_cache, db, last_written)` pattern. - `BinaryMerkleizer::{new,new_bal}` open via `provider.open_state()` instead of `BinaryTrieState::new()`. The `provider` field loses its `#[allow(dead_code)]` — load-bearing now. - `EmptyTrieBackend` added to `binary-trie::db` for symmetry with `EmptyBinaryTrieProvider` (used by future test paths that want the default no-op `open_state`). After this fix, each block's merkleizer trie contains parent_state + this block's writes. `state.trie_get` works for cross-block reads. Layer-cache + on-disk fallback flows identically to the MPT pipeline. ethrex-binary-trie 143/143, ethrex-storage 49/49, ethrex-blockchain 7/7 all pass. fmt + clippy clean on touched code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
tracingCloses #7