ARM64 Voodoo recompiler by skiretic#6887
Merged
Merged
Conversation
* Add ARM64 Voodoo JIT codegen scaffolding (Phase 1) Create vid_voodoo_codegen_arm64.h with: - voodoo_arm64_data_t struct mirroring x86-64 layout - 292 ARM64 instruction encoding macros (GPR + NEON) - Struct offset constants for JIT field access - NEON lookup tables (alookup, aminuslookup, bilinear_lookup) - voodoo_codegen_init/close for executable memory management - voodoo_get_block with W^X toggle + I-cache flush - voodoo_generate with prologue/epilogue (save/restore callee-saved GPRs x19-x28, FP/LR, NEON d8-d13; load pinned constants; pixel loop skeleton with x-coordinate increment and loop branch) Guard changes: - vid_voodoo_render.h: add __aarch64__/_M_ARM64 to NO_CODEGEN gate - vid_voodoo_render.c: add ARM64 include path for new codegen header Also add planning docs, build script, and changelog. * Add compile-time static assertions for all struct offset constants Verify all 50 STATE_* and PARAMS_* offset constants against actual offsetof() values using _Static_assert. This catches any layout differences between assumed and actual struct layouts at compile time. All assertions pass on ARM64 (Apple Silicon, LP64). * Update checklist: mark Phase 1 items complete (build verified) All scaffolding items done except runtime test. Build passes with all struct offset assertions verified at compile time. * Mark agent verification checklist item complete * Fix missing stdint.h include in ARM64 codegen header Add #include <stdint.h> to vid_voodoo_codegen_arm64.h to resolve uint8_t, uint16_t, uint32_t type errors. Build now passes cleanly. * Update changelog with Phase 1 build fix Document stdint.h fix and build verification. * Phase 1 complete: runtime test passed - Add clean-build-and-sign.sh script for full clean builds - Mark Phase 1 runtime test as complete in checklist - Document runtime test success in changelog Runtime test verified: emulator launches, Voodoo initializes, rendering falls through to interpreter (as expected - pixel pipeline not yet implemented). Phase 1 is now 100% complete. Ready for Phase 2.
* Phase 2: Implement pixel loop, stipple test, W-depth, Z-depth, and depth test Add the core pixel pipeline loop structure to the ARM64 Voodoo JIT: - Stipple test: both pattern stipple (bit lookup from real_y/x position) and rotating stipple (ROR + TBZ on bit 31) - Tiled X calculation for tiled framebuffer modes - W-depth computation using CLZ (ARM64 equivalent of x86 BSR) with proper clamping to 0..0xFFFF - Z-buffer depth from state->z with SAR 12 and signed clamping - Depth bias (zaColor addition with 16-bit mask) - All 8 DEPTHOP modes: NEVER (immediate RET), LESSTHAN, EQUAL, LESSTHANEQUAL, GREATERTHAN, NOTEQUAL, GREATERTHANEQUAL, ALWAYS - Per-pixel state increments: ib/ig/ir/ia via NEON 4xS32 ADD/SUB, z via GPR, tmu0/tmu1 s/t via NEON 2xD64 ADD/SUB, tmu0/tmu1 w and global w via GPR 64-bit ADD/SUB - Pixel and texel counter increments - Forward branch patching macros (PATCH_FORWARD_BCOND, PATCH_FORWARD_B, PATCH_FORWARD_TBxZ, PATCH_FORWARD_CBxZ) - ARM64 bitmask immediate macros (AND_BITMASK, ANDS_BITMASK, ORR_BITMASK, TST_BITMASK, AND_MASK convenience wrappers) - Texture fetch placeholder calls for Phase 3 integration The depth test uses unsigned comparison (depth values are 0..0xFFFF) with appropriate ARM64 condition codes: CS for >=, HI for >, LS for <=, CC for <. * Add test VM launch helper script Create scripts/test-with-vm.sh to launch 86Box with the Windows 98 Low End test VM (configured with Voodoo card). Usage: ./scripts/test-with-vm.sh * Phase 2 complete: mark checklist and update changelog Runtime test passed: - Quake 3: black screen (expected) - 3DMark 99: gray screen (expected) - No crashes, depth pipeline executes correctly Color/texture pipeline not yet implemented (Phase 3-4).
Implement codegen_texture_fetch() and TMU combine paths for ARM64: - Perspective-correct W division using SDIV (replaces x86 IDIV) - LOD calculation via CLZ (replaces BSR, inverted: 63-CLZ) - Point-sampled texture lookup with clamp/wrap S/T - Bilinear filtered 4-tap blend using NEON (UXTL+MUL+ADD+EXT+USHR+SQXTUN) - Mirror S/T via TBZ+MVN (replaces TEST+JZ+NOT) - TMU0-only, TMU1-passthrough, and dual-TMU combine paths - Dual-TMU tc_mselect/tc_add/tc_invert for RGB + tca_* for alpha - trexInit1 override path - Upstream bug at x86 line 1303 (0x8E) NOT ported — correct ADD used Bug fixes discovered during review: - Bilinear LDR D addressing: added LSL w4, w4, #2 before 64-bit texel pair loads to convert texel index to byte offset (ARM64 LDR Dt has no LSL #2 option, unlike x86 MOVQ with *4 scaling) - LOD mantissa shift: changed LSR_REG to LSR_REG_X (64-bit) since the W reciprocal after LSL #8 can exceed 32 bits - AND_BITMASK for 0xF0 mask: corrected from (N=0,immr=24,imms=27) which was an invalid encoding to (N=0,immr=28,imms=3) per ARM64 logical immediate rules - Added #include <stddef.h> for offsetof() in _Static_assert checks
* Phase 4: Color/alpha combine pipeline for ARM64 Voodoo JIT Implement the complete color and alpha combine stages translating x86-64 codegen lines ~1689-2228 to ARM64/NEON instructions. Color select pipeline: - CC_LOCALSELECT_ITER_RGB, CC_LOCALSELECT_TEX, CC_LOCALSELECT_COLOR1 - Local color select override via tex_a bit 7 (TBZ/TBNZ branching) Chroma key test: - Compare selected RGB source against params->chromaKey (24-bit mask) - Skip pixel on match using CBZ forward branch Alpha pipeline: - Alpha select: A_SEL_ITER_A (with CLAMP), A_SEL_TEX, A_SEL_COLOR1 - CCA local select: ITER_A, COLOR0, ITER_Z (with CLAMP) - Alpha mask test via TBZ on bit 0 - Full CCA combine: zero_other, sub_clocal, mselect (ZERO/ALOCAL/ AOTHER/ALOCAL2/TEX), reverse_blend, multiply+shift, add, clamp, invert_output Color combine pipeline: - cc_zero_other, cc_sub_clocal using NEON 4x16 arithmetic - cc_mselect: ZERO, CLOCAL, AOTHER, ALOCAL, TEX, TEXRGB - Reverse blend (XOR with 0xFF + add 1) - Signed multiply via SMULL+SSHR+SQXTN (3 insns vs 5 on SSE2) - cc_add (add clocal back) - SQXTUN pack + cc_invert_output - Result saved to v13 for fog stage Fix skip position patching: chroma uses CBZ (PATCH_FORWARD_CBxZ), alpha mask uses TBZ (PATCH_FORWARD_TBxZ). * Update checklist: mark Phase 4 color/alpha combine items complete * Add JIT validation logging for ARM64 Voodoo codegen Add rate-limited diagnostic logging at three critical JIT pipeline points to verify code generation and execution during testing: 1. Cache HIT (first 20 occurrences) - logs block reuse with mode params 2. Code GENERATE (unlimited) - logs every JIT compilation with full config 3. Code EXECUTE (first 50 occurrences) - logs JIT dispatch with coordinates All logging is gated behind VOODOO_JIT_DEBUG / VOODOO_JIT_DEBUG_EXEC defines (set to 1) and uses pclog() for output to the 86Box log file. Set to 0 to disable before release. * Fix Phase 1 stack frame size and comment Validation found that the prologue comment claimed d14/d15 were saved at SP-144, but the actual code only saves d8-d13 (3 NEON register pairs). Since v14/v15 are never used in the generated code, this reduces the frame size from 144 to 128 bytes, saving 16 bytes per JIT call. Changes: - Remove misleading "SP-144: d14, d15" comment - Reduce frame size from 144 to 128 bytes (prologue and epilogue) - Stack remains 16-byte aligned (128 = 8 × 16) Addresses finding from voodoo-arch validation of Phase 1. * Update changelog: Phase 4 validation and frame size fix * Update checklist: mark validation complete for phases 1-4 All 4 phases validated against official 3dfx specifications: - Phase 1: ARM64 ABI compliance verified, frame size optimized - Phase 2: All depth test modes validated - Phase 3: Texture pipeline validated, upstream bug fix verified - Phase 4: Color/alpha combine validated, ARM64 improvements noted
* Phase 5+6: Complete Voodoo ARM64 pixel pipeline
Implements the full remaining pixel rendering pipeline:
**Phase 5 (Fog + Alpha Test + Blend):**
- Fog: FOG_CONSTANT, FOG_ADD, FOG_MULT modes
- Fog sources: w_depth table lookup, Z, alpha, W
- Alpha test: all 8 AFUNC comparison modes (NEVER, LESSTHAN, EQUAL,
LESSTHANEQUAL, GREATERTHAN, NOTEQUAL, GREATERTHANEQUAL, ALWAYS)
- Alpha blend: dest_afunc (9 modes: AZERO, ASRC_ALPHA, A_COLOR,
ADST_ALPHA, AONE, AOMSRC_ALPHA, AOM_COLOR, AOMDST_ALPHA,
ACOLORBEFOREFOG)
- Alpha blend: src_afunc (9 modes: AZERO, ASRC_ALPHA, A_COLOR,
ADST_ALPHA, AONE, AOMSRC_ALPHA, AOM_COLOR, AOMDST_ALPHA, ASATURATE)
**Phase 6 (Framebuffer Write + Dithering):**
- Dithering: 4x4 and 2x2 dither pattern support
- RGB565 pack with dither or direct shift-and-mask
- Framebuffer write: linear and tiled addressing modes
- Depth write: alpha-buffer and non-alpha-buffer paths
- Per-pixel state increments: dRdX, dGdX, dBdX, dAdX, dZdX, dSdX,
dTdX, dWdX (for both TMUs + global W)
- Pixel and texel counter updates
Total addition: 677 lines of ARM64 codegen completing the full
scanline rasterizer from prologue through per-pixel write-back.
* Add JIT debug logging runtime toggle to Voodoo card settings
Replace compile-time #define VOODOO_JIT_DEBUG with a CONFIG_BINARY UI
checkbox ("JIT Debug Logging", default OFF). When enabled, opens
<vm_dir>/voodoo_jit.log and writes all JIT GENERATE/EXECUTE/HIT
diagnostics there via fprintf. When disabled, no file is created and
no logging overhead. The toggle is purely observational and never
affects the JIT-vs-interpreter control flow.
* Add debugging docs, test script, and gitignore updates
* Fix Voodoo 2 detection failure when Dynamic Recompiler is enabled
Move jit_debug config entry outside the #ifndef NO_CODEGEN guard so the
voodoo_config[] array structure is stable regardless of codegen state.
Previously, both recompiler and jit_debug entries were conditionally
compiled, causing config field index misalignment when loading VMs saved
with a different codegen setting.
* Move jit_debug config outside NO_CODEGEN guard in Banshee configs
Apply the same fix from the voodoo_config[] array (a1163d6) to
banshee_sgram_config[], banshee_sgram_16mbonly_config[], and
banshee_sdram_config[]. The jit_debug entry is now unconditionally
present in all four device config arrays for consistency.
* Fix Voodoo 2 non-perspective texture alignment bug + add JIT verify mode
ARM64 STR_X to STATE_tex_s (offset 188) silently encoded as offset 184
due to unsigned-offset 8-byte alignment truncation. The non-perspective
texture path wrote tex_s to the wrong location, causing every pixel to
sample from texture column S=0. Fixed by using STR_W (4-byte aligned).
Also adds JIT verification mode (jit_debug=2) that runs both JIT and
interpreter per scanline and compares pixel output for debugging.
* Support generic ARM64 (Linux, Windows) for Voodoo JIT codegen The ARM64 Voodoo JIT was already mostly portable — W^X calls were behind __APPLE__ guards, and the NO_CODEGEN gate already included _M_ARM64. The only missing piece was a Windows ARM64 I-cache flush path using FlushInstructionCache instead of GCC/Clang's __clear_cache builtin. * Add comprehensive testing and technical documentation - Add TESTING-GUIDE.md: User-facing build/test guide for macOS ARM64 - Prerequisites, build instructions, VM setup - Voodoo configuration (Dynamic Recompiler toggle, JIT debug logging) - Testing matrix, what to look for, issue reporting - Add ARM64-CODEGEN-TECHNICAL.md: Deep technical reference - Architecture overview, register allocation, encoding macros - Complete pipeline phases walkthrough (prologue through epilogue) - Key differences from x86-64, known issues, maintenance guide - Archive old planning/debug docs to voodoo-arm64-port/archive/ - Moved 11 working docs to archive (checklist, debug sessions, etc.) - Keeps main directory clean with only user-facing documentation * Remove hardcoded username from VM paths - Replace /Users/anthony with $HOME in test-with-vm.sh - Replace /Users/anthony with ~ in archived debugging doc - Makes repo safe to share publicly without personal info
- Clarify Voodoo 1/2 require separate 2D card as primary - Clarify Banshee/Voodoo 3 are set as primary video card directly - Note only one Voodoo card can be used at a time - Fix step numbering (was skipping 4)
Replace remaining x86-64 mnemonics (CMOVS, CMOVAE, IMUL, PADDW, SHL, SHR, SAR) with ARM64 equivalents (CSEL, MUL, ADD, LSL, LSR, ASR). Fix register references (EBX → w5, xmm_00_ff_w → neon_00_ff_w). Correct alpha blend shift comments to reflect doubled input values.
- Fix approximate line reference (~51 → ~45) for register assignment block - Update stale voodoo_generate() docblock to describe all 6 phases - Document all three rounds of comment audits in CHANGELOG.md
The ARM64 codegen used file-scope statics (last_block, next_block_to_write, voodoo_jit_hit/gen_count, voodoo_recomp) that are shared across all Voodoo instances and render threads. This creates race conditions when multiple render threads (up to 4) access them concurrently via odd_even indexing. Add 7 new fields to voodoo_t (jit_last_block, jit_next_block_to_write, jit_recomp, jit_hit_count, jit_gen_count, jit_exec_count, jit_verify_mismatches) and update voodoo_get_block(), voodoo_codegen_init(), and render loop references to use instance state. Also includes previously uncommitted bounds-checking additions to ARM64 code emission and branch patching macros (arm64_codegen_check_emit_bounds, arm64_codegen_check_patch_pos, arm64_codegen_check_branch_offset).
Add verification notes to changelog documenting successful build and VM launch after refactoring cache/counter state from globals to per-instance voodoo_t.
- Add bounds-checked code emission with interpreter fallback on overflow instead of fatal() process abort. Overflow slots are memoized as "rejected" to avoid repeated regen churn on the same key. - Fix cache probe order: lookup now starts from jit_last_block hint and wraps, restoring intended locality behavior. - Convert JIT debug/recomp counters to ATOMIC_INT to reduce races between render threads. - Add per-slot valid/rejected flags to prevent reuse of partial blocks. - Correct v12-v15 register comment to match actual save/restore (d12-d13).
Add comprehensive comments documenting 3D graphics concepts at each pipeline stage, algorithm explanations, and variable origin notes. One code change: cc_mselect == 1 → CC_MSELECT_CLOCAL (named constant).
Corrected interpreter FPS baselines (were far too low), reframed JIT benefit around CPU utilization reduction (20-50%) rather than overstated 8x FPS gains. Replaced defunct 3dfxarchive.com with VOGONS Wiki and VOGONS Vintage Driver Library.
Two-command workflow for new contributors: - `deps`: installs all required Homebrew packages - `build`: clean configure + build + ad-hoc codesign with JIT entitlements
JIT debug logging can produce hundreds of MB in minutes — warn users to only enable when actively debugging and clean up afterwards.
Documents all macro consolidation changes, what was/wasn't changed, before/after examples, deferral rationale, and R7 endianness verification.
Docker-based build system for fully-bundled AppImages targeting Raspberry Pi 5 and other aarch64 Linux systems: - Dockerfile: Debian Bullseye ARM64 build environment - AppImageBuilder.yml: appimage-builder recipe matching official 86Box - build.sh: host-side launcher with persistent cmake cache volume - README.md: step-by-step build guide - .gitignore: excludes output/ directory Bundles glibc 2.31, Qt5 + Wayland, all runtime deps — no install needed on target. Reference added to testing guide.
Compile OpenAL 1.23.1, rtmidi 4.0.0, FluidSynth 2.3.0, and SDL2 2.0.20 from source with minimal features to eliminate transitive dependencies (libnsl, libjack, libpulse) that cause runtime failures on Fedora. - Dockerfile: add build deps for source compilation, pin appimage-builder to same commit as official (22fefa2), remove distro -dev packages for the four libs - build-appimage.sh: add Step 0 to compile libs with guards for rebuild caching, set PKG_CONFIG_PATH for custom libs, copy .so files into AppDir, use source tree .desktop and .metainfo.xml - AppImageBuilder.yml: trim ~40 packages (crypto/TLS, systemd/udev, excess Qt modules, image format libs, cups/avahi), add file exclusions matching official, add required comp: gzip field - README.md: update bundled libraries documentation Tested on Fedora ARM64 (no missing lib errors) and Raspberry Pi OS. AppImage size reduced from 77 MB to 65 MB.
New scripts/analyze-jit-log.py performs automated health analysis of Voodoo JIT debug logs — single streaming pass handles multi-GB logs, checks compilation stats, interpreter fallbacks, pipeline coverage, pixel output quality, and produces a summary verdict. Also adds a VOODOO JIT: INIT line at the start of debug logs (ARM64 only) recording render_threads, use_recompiler, and jit_debug level.
Document the analyze-jit-log.py script in the debug logging section and update the issue reporting workflow to run the analyzer first.
Remove two dead code emissions identified by the codegen audit: - MINOR-1: MOV_V(5, 1) in dual-TMU combine saved v1 into v5 but v5 was never read back. ARM64 SMULL_4S_4H doesn't clobber inputs like x86-64's PMULLW sequence, so the save was unnecessary. - MINOR-3: MOVI_V2D_ZERO(2) zeroed v2 as an unpacking constant but v2 was never referenced. ARM64 uses UXTL for byte-to-halfword widening instead of x86-64's PUNPCKLBW which needs a zero register. Saves 8 bytes per generated code block. Update audit report to mark both items as resolved.
H7: replace 14 LDR d16 (alookup[1]) loads with pinned v8 register H8: eliminate 14 redundant MOV v17 before USHR (use different Rd/Rn) Saves 28 instructions per alpha-blended pixel, no functional change.
22GB log (323M lines) analyzed in 38s — VERDICT: HEALTHY. 75,041 blocks, 276 pipeline configs, 5.2B pixels, 0 errors.
…+M5+M6+L1+L2) LDP pairing at 4 sites (M1), hoist fogColor to v11 replacing dead neon_minus_254 (M3), extract TMU0 alpha once in dual-TMU TCA (M5), BFI for no-dither RGB565 packing (M6), CBZ for tmu_w guard (L1), eliminate MOV w11,w7 in texture fetch (L2). New encoding macros: ARM64_BFI, ARM64_CBZ_X_PLACEHOLDER, ARM64_CBNZ_X_PLACEHOLDER. Tested: Q3, Turok, 3DMark99, 3DMark2000, UT99. 68k blocks compiled, 4.4B pixels, 274 pipeline configs, zero errors. Full feature parity audit against x86-64 reference confirmed.
…8: R2-24+R2-25+R2-13+R2-27) - Move STATE_x LDR before loop — redundant reload every iteration (R2-24) - Eliminate MOV w4,w28 in loop control — use w28 directly, reorder CMP before MOV (R2-25) - Remove redundant MOV v16,v0 in color combine multiply — v0 unmodified before SMULL (R2-13) - MVN directly from w28 in stipple pattern — eliminate intermediate MOV copy (R2-27) Saves ~3 instructions/pixel unconditionally, +1 in stipple paths.
…t Batch 8 Fix: TMU0 alpha extraction (FMOV w13, s7 + LSR w13, 24) was placed at the TCA section start, AFTER the tca_sub_clocal block that reads w13. Moved extraction before the SMULL so w13 is valid when first read. This caused rendering artifacts when tca_sub_clocal was active in dual-TMU texture combine paths. Also reverts Batch 8 (R2-24/R2-25/R2-13/R2-27) pending re-validation. LDP pairings (Batch 7/M1) verified correct against struct layout.
…r counter Batch 8 (Round 2 audit): R2-24 (STATE_x LDR before loop), R2-25 (eliminate MOV w4,w28 in loop control), R2-13 (remove redundant MOV v16,v0 in cc multiply), R2-27 (MVN directly from w28 in stipple). ~4 insns/pixel saved. Debug logging: removed the < 50 cap on EXECUTE/POST logging so full session logs are generated. Added jit_interp_count to track interpreter fallback executions, with a summary line at shutdown (jit_exec vs interp_exec vs gen). Tested: 5.5B pixels, 351 configs, 174M JIT executions, 0 errors (Voodoo 3).
Consolidated per-batch and per-category instruction savings table. ~84-105 insns/pixel removed across 8 batches. Updated Batch 8 status to DONE with commit hash.
Batch 9, 10, D, JIT cache expansion, perf benchmarking, debug logging levels — with savings estimates, risk levels, and recommended order.
Real workloads hit 180-351 unique pipeline configurations, causing massive cache thrashing with only 8 slots (258K recompilations for 351 configs in Q3). Expanding to 32 slots captures the full working set after warmup. Memory cost: 512KB -> 2MB (well under 4MB budget).
…+R2-08) R2-07: Eliminate ebp_store memory round-trip in bilinear path. Hold bilinear lookup index in w17 (IP1 scratch) instead of STR/LDR through STATE_ebp_store. Saves 1 STR + 1 LDR per bilinear-textured pixel. R2-12: Replace 8 FMOV+DUP_V4H_LANE(x,x,0) pairs with single DUP_V4H_GPR. All sites broadcast values in 0-255 range (alpha, LOD frac, detail blend), so 16-bit GPR-to-vector broadcast is semantically identical. Sites: TC_MSELECT_DETAIL (TMU0/1), TC_MSELECT_LOD_FRAC (TMU0/1), CC_MSELECT_AOTHER, CC_MSELECT_ALOCAL (both paths), CC_MSELECT_TEX. R2-08: Cache original LOD in w11 before ADD w6,w6,#4 in point-sample path. Eliminates LDR w11,[x0,#STATE_lod] reload in S clamp/wrap section.
R2-23: EMIT_MOV_IMM64, EMIT_LOAD_NEON_CONST, and dither_rb_addr now skip zero halfwords -- MOVZ targets the first non-zero halfword, then MOVK only for remaining non-zero ones. Saves ~10 instructions per block across 4 sites (7 prologue ptrs + 3 NEON consts + 1 dither ptr). New ARM64_MOVZ_X_HW(d, imm16, hw) encoding macro added. R2-09: (1<<48) dividend for perspective W division now uses a single MOVZ X4, #1, LSL 86Box#48 instead of 4 instructions (MOVZ #0 + 3 MOVK). Saves 3 instructions per perspective-textured block.
FOG_W: the interpreter computes fog_a = (w >> 32) & 0xff, masking to the low byte before clamping (making clamp a no-op). The JIT was doing CLAMP(w >> 32) without the mask, so when w >> 32 exceeded 255 the JIT would saturate to 0xFF while the interpreter wrapped to the low byte. This caused 20 VERIFY MISMATCH errors (224 differing pixels across 844M rendered) in fog-W pipeline configs. Fix: replace the 4-instruction BIC+CMP+CSEL clamp with a single AND #0xFF, matching the interpreter exactly. FOG_Z: the interpreter uses (z >> 20) & 0xff but the JIT (copied from x86-64) was using z >> 12. Correct the shift to match the interpreter. This path was not triggered by the current test suite but is a latent correctness bug.
Verify mode (jit_debug=2) fixes: - Add EXECUTE/POST/PIXEL logging inside verify block (was skipped due to !jit_verify_active guard on the normal JIT path) - Replace alloca() with malloc()/free() for verify buffers to prevent stack overflow (SIGBUS) on long runs - Both changes are verify-mode-only, no impact on normal JIT operation Enhanced log analyzer (scripts/analyze-jit-log.c): - Per-fogMode mismatch breakdown with fog enabled/disabled annotation - Top 10 pipeline config mismatch table - Diff magnitude distribution (±0-1, ±2-3, ±4-6, ±7+ buckets) - Max absolute |dR|, |dG|, |dB| tracking - Match rate percentage calculation - All counters are per-worker thread, merged after scan Documentation: - verify-mismatch-analysis.md: full root cause analysis concluding all verify mismatches are test harness artifacts, not real JIT bugs. Documents x86-64 JIT's divide-by-2 fog rounding technique (inherited by ARM64 port) with line-level source citations. - CHANGELOG.md: complete verify mode debugging chronicle - TESTING-GUIDE.md: expanded verify mode section with known limitations and recommended validation workflow - optimization-summary.md: added verify mode findings, updated remaining work items (cache expansion and debug levels now done) - Preserved raw data: final-validation-log.txt, verify-mismatch-pre-fog-fix.txt
…fied 4 parallel verification agents analyzed suspected JIT-interpreter differences: - Finding 1 (alpha blend /255): FALSE ALARM — math is identical - Finding 2 (TMU1 RGB negate ordering): real ±1, inherited from x86-64 - Finding 3 (AFUNC_ASATURATE): real bug in INTERPRETER (dest_r instead of src_r) - Finding 4 (zaColor depth bias): real clamp-vs-truncate, inherited from x86-64 Zero ARM64-specific bugs found. ASATURATE interpreter fix pending.
…h interpreter Two accuracy fixes making the ARM64 JIT more accurate than the x86-64 JIT: Finding 2: TMU1 tc_sub_clocal negate ordering — the interpreter computes (-clocal * factor) >> 8 (negate first), both JITs computed -(clocal * factor >> 8) (negate last), producing ±1 when clocal*factor % 256 != 0. Fixed by negating clocal before the widening multiply. Same instruction count (4). Finding 4: zaColor depth bias clamp — the interpreter uses CLAMP16() to clamp depth+bias to [0, 0xFFFF], both JITs truncated via AND/UXTH (wrapping). Fixed with SXTH sign-extension + CMP/CSEL clamp sequence matching the interpreter. Verify mode: mismatches dropped 68% (2.9M→926K events), differing pixels dropped 74% (22.3M→5.8M), match rate improved 99.24%→99.61%. jit_debug=1 remains HEALTHY (zero errors).
Remove Linux ARM64 AppImage bullet from README (not distributing builds). Move completed optimization/audit docs to voodoo-arm64-port/archive/. Delete README2.md draft.
- scripts/analyze-jit-log.c: POSIX version (mmap + pthreads) - scripts/analyze-jit-log-win32.c: Win32 version (CreateFileMapping + _beginthreadex) - scripts/README-jit-analyzer.md: usage documentation
Clean up comments in vid_voodoo_codegen_arm64.h (no code changes): - Remove optimization batch/tracking IDs (Batch N/XX, RN-XX, H4, M1, etc.) - Delete stale comments referencing removed code - Rename leftover x86-64 xmm_ prefixes to neon_ to match actual variable names - Fix incorrect bit-range notation in RGB565 packing comment - Fix misidentified register values and alignment claims - Clarify opaque internal references and ambiguous notation - Restore line breaks collapsed by earlier comment deletions
Pad params_read_idx, params_write_idx, and render_voodoo_busy to separate 128-byte cache lines on ARM64 to eliminate false sharing between render threads and the FIFO/CPU thread. Add a short spin-wait (256 yield iterations) in render_thread() before sleeping to absorb burst triangle submissions without context-switch overhead. All changes guarded behind __aarch64__ || _M_ARM64 — zero impact on x86-64 or other platforms.
Document render_threads = 1 as recommended for ARM64 JIT on Apple Silicon. CPU emulation runs on E-cores which get starved when multiple render threads heavily load P-cores. Note that symmetric ARM64 platforms (RPi 4/5) may behave differently.
Three JIT infrastructure improvements: 1. Move DEPTHOP_NEVER and AFUNC_NEVER checks before the prologue. Previously emitted bare RET after 20 registers were saved and SP decremented by 176 bytes — an ABI violation. Now emits RET before any register saves, making it correct. 2. Replace round-robin cache eviction with LRU. Each slot carries a last_used timestamp from a per-partition monotonic counter. On miss, the slot with the smallest timestamp is evicted. Rejected slots get last_used=0 for immediate eviction. Array layout changed from interleaved (stride-4) to contiguous per-partition for better cache locality. voodoo_generate() now returns actual code size; I-cache flush narrowed from full BLOCK_SIZE to actual emitted bytes. 3. Skip redundant mprotect toggles on Linux ARM64 where JIT pages are allocated RWX. The arm64_jit_rwx flag short-circuits set_writable and set_executable, avoiding TLB shootdown overhead. No-op on macOS and Windows where W^X is enforced. Also: log analyzer updated for new GENERATE format (code_size= replaces block=, output shows min/avg/max code size stats), technical reference updated, comment fixes applied.
… voodoo_t Fixes shared LRU eviction state between SLI cards — file-static jit_generation[4] meant the second card's init would reset the first card's LRU timestamps. Now each voodoo_t owns its own counters.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Briefly describe what you are submitting.
Checklist
References
Provide links to datasheets or other documentation that helped you implement this pull request.