Skip to content

ARM64 Voodoo recompiler by skiretic#6887

Merged
OBattler merged 68 commits into
86Box:masterfrom
skiretic:master
Mar 8, 2026
Merged

ARM64 Voodoo recompiler by skiretic#6887
OBattler merged 68 commits into
86Box:masterfrom
skiretic:master

Conversation

@OBattler

@OBattler OBattler commented Mar 8, 2026

Copy link
Copy Markdown
Member

Summary

Briefly describe what you are submitting.

Checklist

References

Provide links to datasheets or other documentation that helped you implement this pull request.

skiretic and others added 30 commits February 15, 2026 18:46
* Add ARM64 Voodoo JIT codegen scaffolding (Phase 1)

Create vid_voodoo_codegen_arm64.h with:
- voodoo_arm64_data_t struct mirroring x86-64 layout
- 292 ARM64 instruction encoding macros (GPR + NEON)
- Struct offset constants for JIT field access
- NEON lookup tables (alookup, aminuslookup, bilinear_lookup)
- voodoo_codegen_init/close for executable memory management
- voodoo_get_block with W^X toggle + I-cache flush
- voodoo_generate with prologue/epilogue (save/restore callee-saved
  GPRs x19-x28, FP/LR, NEON d8-d13; load pinned constants; pixel
  loop skeleton with x-coordinate increment and loop branch)

Guard changes:
- vid_voodoo_render.h: add __aarch64__/_M_ARM64 to NO_CODEGEN gate
- vid_voodoo_render.c: add ARM64 include path for new codegen header

Also add planning docs, build script, and changelog.


* Add compile-time static assertions for all struct offset constants

Verify all 50 STATE_* and PARAMS_* offset constants against actual
offsetof() values using _Static_assert. This catches any layout
differences between assumed and actual struct layouts at compile time.

All assertions pass on ARM64 (Apple Silicon, LP64).


* Update checklist: mark Phase 1 items complete (build verified)

All scaffolding items done except runtime test. Build passes with
all struct offset assertions verified at compile time.


* Mark agent verification checklist item complete


* Fix missing stdint.h include in ARM64 codegen header

Add #include <stdint.h> to vid_voodoo_codegen_arm64.h to resolve
uint8_t, uint16_t, uint32_t type errors. Build now passes cleanly.


* Update changelog with Phase 1 build fix

Document stdint.h fix and build verification.


* Phase 1 complete: runtime test passed

- Add clean-build-and-sign.sh script for full clean builds
- Mark Phase 1 runtime test as complete in checklist
- Document runtime test success in changelog

Runtime test verified: emulator launches, Voodoo initializes,
rendering falls through to interpreter (as expected - pixel
pipeline not yet implemented).

Phase 1 is now 100% complete. Ready for Phase 2.
* Phase 2: Implement pixel loop, stipple test, W-depth, Z-depth, and depth test

Add the core pixel pipeline loop structure to the ARM64 Voodoo JIT:

- Stipple test: both pattern stipple (bit lookup from real_y/x position)
  and rotating stipple (ROR + TBZ on bit 31)
- Tiled X calculation for tiled framebuffer modes
- W-depth computation using CLZ (ARM64 equivalent of x86 BSR) with proper
  clamping to 0..0xFFFF
- Z-buffer depth from state->z with SAR 12 and signed clamping
- Depth bias (zaColor addition with 16-bit mask)
- All 8 DEPTHOP modes: NEVER (immediate RET), LESSTHAN, EQUAL,
  LESSTHANEQUAL, GREATERTHAN, NOTEQUAL, GREATERTHANEQUAL, ALWAYS
- Per-pixel state increments: ib/ig/ir/ia via NEON 4xS32 ADD/SUB,
  z via GPR, tmu0/tmu1 s/t via NEON 2xD64 ADD/SUB, tmu0/tmu1 w and
  global w via GPR 64-bit ADD/SUB
- Pixel and texel counter increments
- Forward branch patching macros (PATCH_FORWARD_BCOND, PATCH_FORWARD_B,
  PATCH_FORWARD_TBxZ, PATCH_FORWARD_CBxZ)
- ARM64 bitmask immediate macros (AND_BITMASK, ANDS_BITMASK, ORR_BITMASK,
  TST_BITMASK, AND_MASK convenience wrappers)
- Texture fetch placeholder calls for Phase 3 integration

The depth test uses unsigned comparison (depth values are 0..0xFFFF) with
appropriate ARM64 condition codes: CS for >=, HI for >, LS for <=, CC for <.


* Add test VM launch helper script

Create scripts/test-with-vm.sh to launch 86Box with the
Windows 98 Low End test VM (configured with Voodoo card).

Usage: ./scripts/test-with-vm.sh


* Phase 2 complete: mark checklist and update changelog

Runtime test passed:
- Quake 3: black screen (expected)
- 3DMark 99: gray screen (expected)
- No crashes, depth pipeline executes correctly

Color/texture pipeline not yet implemented (Phase 3-4).
Implement codegen_texture_fetch() and TMU combine paths for ARM64:
- Perspective-correct W division using SDIV (replaces x86 IDIV)
- LOD calculation via CLZ (replaces BSR, inverted: 63-CLZ)
- Point-sampled texture lookup with clamp/wrap S/T
- Bilinear filtered 4-tap blend using NEON (UXTL+MUL+ADD+EXT+USHR+SQXTUN)
- Mirror S/T via TBZ+MVN (replaces TEST+JZ+NOT)
- TMU0-only, TMU1-passthrough, and dual-TMU combine paths
- Dual-TMU tc_mselect/tc_add/tc_invert for RGB + tca_* for alpha
- trexInit1 override path
- Upstream bug at x86 line 1303 (0x8E) NOT ported — correct ADD used

Bug fixes discovered during review:
- Bilinear LDR D addressing: added LSL w4, w4, #2 before 64-bit texel
  pair loads to convert texel index to byte offset (ARM64 LDR Dt has no
  LSL #2 option, unlike x86 MOVQ with *4 scaling)
- LOD mantissa shift: changed LSR_REG to LSR_REG_X (64-bit) since the
  W reciprocal after LSL #8 can exceed 32 bits
- AND_BITMASK for 0xF0 mask: corrected from (N=0,immr=24,imms=27) which
  was an invalid encoding to (N=0,immr=28,imms=3) per ARM64 logical
  immediate rules
- Added #include <stddef.h> for offsetof() in _Static_assert checks
* Phase 4: Color/alpha combine pipeline for ARM64 Voodoo JIT

Implement the complete color and alpha combine stages translating
x86-64 codegen lines ~1689-2228 to ARM64/NEON instructions.

Color select pipeline:
- CC_LOCALSELECT_ITER_RGB, CC_LOCALSELECT_TEX, CC_LOCALSELECT_COLOR1
- Local color select override via tex_a bit 7 (TBZ/TBNZ branching)

Chroma key test:
- Compare selected RGB source against params->chromaKey (24-bit mask)
- Skip pixel on match using CBZ forward branch

Alpha pipeline:
- Alpha select: A_SEL_ITER_A (with CLAMP), A_SEL_TEX, A_SEL_COLOR1
- CCA local select: ITER_A, COLOR0, ITER_Z (with CLAMP)
- Alpha mask test via TBZ on bit 0
- Full CCA combine: zero_other, sub_clocal, mselect (ZERO/ALOCAL/
  AOTHER/ALOCAL2/TEX), reverse_blend, multiply+shift, add, clamp,
  invert_output

Color combine pipeline:
- cc_zero_other, cc_sub_clocal using NEON 4x16 arithmetic
- cc_mselect: ZERO, CLOCAL, AOTHER, ALOCAL, TEX, TEXRGB
- Reverse blend (XOR with 0xFF + add 1)
- Signed multiply via SMULL+SSHR+SQXTN (3 insns vs 5 on SSE2)
- cc_add (add clocal back)
- SQXTUN pack + cc_invert_output
- Result saved to v13 for fog stage

Fix skip position patching: chroma uses CBZ (PATCH_FORWARD_CBxZ),
alpha mask uses TBZ (PATCH_FORWARD_TBxZ).


* Update checklist: mark Phase 4 color/alpha combine items complete


* Add JIT validation logging for ARM64 Voodoo codegen

Add rate-limited diagnostic logging at three critical JIT pipeline
points to verify code generation and execution during testing:

1. Cache HIT (first 20 occurrences) - logs block reuse with mode params
2. Code GENERATE (unlimited) - logs every JIT compilation with full config
3. Code EXECUTE (first 50 occurrences) - logs JIT dispatch with coordinates

All logging is gated behind VOODOO_JIT_DEBUG / VOODOO_JIT_DEBUG_EXEC
defines (set to 1) and uses pclog() for output to the 86Box log file.
Set to 0 to disable before release.


* Fix Phase 1 stack frame size and comment

Validation found that the prologue comment claimed d14/d15 were saved
at SP-144, but the actual code only saves d8-d13 (3 NEON register pairs).
Since v14/v15 are never used in the generated code, this reduces the
frame size from 144 to 128 bytes, saving 16 bytes per JIT call.

Changes:
- Remove misleading "SP-144: d14, d15" comment
- Reduce frame size from 144 to 128 bytes (prologue and epilogue)
- Stack remains 16-byte aligned (128 = 8 × 16)

Addresses finding from voodoo-arch validation of Phase 1.


* Update changelog: Phase 4 validation and frame size fix


* Update checklist: mark validation complete for phases 1-4

All 4 phases validated against official 3dfx specifications:
- Phase 1: ARM64 ABI compliance verified, frame size optimized
- Phase 2: All depth test modes validated
- Phase 3: Texture pipeline validated, upstream bug fix verified
- Phase 4: Color/alpha combine validated, ARM64 improvements noted
* Phase 5+6: Complete Voodoo ARM64 pixel pipeline

Implements the full remaining pixel rendering pipeline:

**Phase 5 (Fog + Alpha Test + Blend):**
- Fog: FOG_CONSTANT, FOG_ADD, FOG_MULT modes
- Fog sources: w_depth table lookup, Z, alpha, W
- Alpha test: all 8 AFUNC comparison modes (NEVER, LESSTHAN, EQUAL,
  LESSTHANEQUAL, GREATERTHAN, NOTEQUAL, GREATERTHANEQUAL, ALWAYS)
- Alpha blend: dest_afunc (9 modes: AZERO, ASRC_ALPHA, A_COLOR,
  ADST_ALPHA, AONE, AOMSRC_ALPHA, AOM_COLOR, AOMDST_ALPHA,
  ACOLORBEFOREFOG)
- Alpha blend: src_afunc (9 modes: AZERO, ASRC_ALPHA, A_COLOR,
  ADST_ALPHA, AONE, AOMSRC_ALPHA, AOM_COLOR, AOMDST_ALPHA, ASATURATE)

**Phase 6 (Framebuffer Write + Dithering):**
- Dithering: 4x4 and 2x2 dither pattern support
- RGB565 pack with dither or direct shift-and-mask
- Framebuffer write: linear and tiled addressing modes
- Depth write: alpha-buffer and non-alpha-buffer paths
- Per-pixel state increments: dRdX, dGdX, dBdX, dAdX, dZdX, dSdX,
  dTdX, dWdX (for both TMUs + global W)
- Pixel and texel counter updates

Total addition: 677 lines of ARM64 codegen completing the full
scanline rasterizer from prologue through per-pixel write-back.


* Add JIT debug logging runtime toggle to Voodoo card settings

Replace compile-time #define VOODOO_JIT_DEBUG with a CONFIG_BINARY UI
checkbox ("JIT Debug Logging", default OFF). When enabled, opens
<vm_dir>/voodoo_jit.log and writes all JIT GENERATE/EXECUTE/HIT
diagnostics there via fprintf. When disabled, no file is created and
no logging overhead. The toggle is purely observational and never
affects the JIT-vs-interpreter control flow.


* Add debugging docs, test script, and gitignore updates


* Fix Voodoo 2 detection failure when Dynamic Recompiler is enabled

Move jit_debug config entry outside the #ifndef NO_CODEGEN guard so the
voodoo_config[] array structure is stable regardless of codegen state.
Previously, both recompiler and jit_debug entries were conditionally
compiled, causing config field index misalignment when loading VMs saved
with a different codegen setting.


* Move jit_debug config outside NO_CODEGEN guard in Banshee configs

Apply the same fix from the voodoo_config[] array (a1163d6) to
banshee_sgram_config[], banshee_sgram_16mbonly_config[], and
banshee_sdram_config[]. The jit_debug entry is now unconditionally
present in all four device config arrays for consistency.


* Fix Voodoo 2 non-perspective texture alignment bug + add JIT verify mode

ARM64 STR_X to STATE_tex_s (offset 188) silently encoded as offset 184
due to unsigned-offset 8-byte alignment truncation. The non-perspective
texture path wrote tex_s to the wrong location, causing every pixel to
sample from texture column S=0. Fixed by using STR_W (4-byte aligned).

Also adds JIT verification mode (jit_debug=2) that runs both JIT and
interpreter per scanline and compares pixel output for debugging.
* Support generic ARM64 (Linux, Windows) for Voodoo JIT codegen

The ARM64 Voodoo JIT was already mostly portable — W^X calls were behind
__APPLE__ guards, and the NO_CODEGEN gate already included _M_ARM64.
The only missing piece was a Windows ARM64 I-cache flush path using
FlushInstructionCache instead of GCC/Clang's __clear_cache builtin.


* Add comprehensive testing and technical documentation

- Add TESTING-GUIDE.md: User-facing build/test guide for macOS ARM64
  - Prerequisites, build instructions, VM setup
  - Voodoo configuration (Dynamic Recompiler toggle, JIT debug logging)
  - Testing matrix, what to look for, issue reporting

- Add ARM64-CODEGEN-TECHNICAL.md: Deep technical reference
  - Architecture overview, register allocation, encoding macros
  - Complete pipeline phases walkthrough (prologue through epilogue)
  - Key differences from x86-64, known issues, maintenance guide

- Archive old planning/debug docs to voodoo-arm64-port/archive/
  - Moved 11 working docs to archive (checklist, debug sessions, etc.)
  - Keeps main directory clean with only user-facing documentation


* Remove hardcoded username from VM paths

- Replace /Users/anthony with $HOME in test-with-vm.sh
- Replace /Users/anthony with ~ in archived debugging doc
- Makes repo safe to share publicly without personal info
- Clarify Voodoo 1/2 require separate 2D card as primary
- Clarify Banshee/Voodoo 3 are set as primary video card directly
- Note only one Voodoo card can be used at a time
- Fix step numbering (was skipping 4)
Replace remaining x86-64 mnemonics (CMOVS, CMOVAE, IMUL, PADDW, SHL,
SHR, SAR) with ARM64 equivalents (CSEL, MUL, ADD, LSL, LSR, ASR).
Fix register references (EBX → w5, xmm_00_ff_w → neon_00_ff_w).
Correct alpha blend shift comments to reflect doubled input values.
- Fix approximate line reference (~51 → ~45) for register assignment block
- Update stale voodoo_generate() docblock to describe all 6 phases
- Document all three rounds of comment audits in CHANGELOG.md
The ARM64 codegen used file-scope statics (last_block, next_block_to_write,
voodoo_jit_hit/gen_count, voodoo_recomp) that are shared across all Voodoo
instances and render threads. This creates race conditions when multiple
render threads (up to 4) access them concurrently via odd_even indexing.

Add 7 new fields to voodoo_t (jit_last_block, jit_next_block_to_write,
jit_recomp, jit_hit_count, jit_gen_count, jit_exec_count,
jit_verify_mismatches) and update voodoo_get_block(), voodoo_codegen_init(),
and render loop references to use instance state.

Also includes previously uncommitted bounds-checking additions to ARM64
code emission and branch patching macros (arm64_codegen_check_emit_bounds,
arm64_codegen_check_patch_pos, arm64_codegen_check_branch_offset).
Add verification notes to changelog documenting successful build and
VM launch after refactoring cache/counter state from globals to
per-instance voodoo_t.
- Add bounds-checked code emission with interpreter fallback on overflow
  instead of fatal() process abort. Overflow slots are memoized as
  "rejected" to avoid repeated regen churn on the same key.
- Fix cache probe order: lookup now starts from jit_last_block hint and
  wraps, restoring intended locality behavior.
- Convert JIT debug/recomp counters to ATOMIC_INT to reduce races
  between render threads.
- Add per-slot valid/rejected flags to prevent reuse of partial blocks.
- Correct v12-v15 register comment to match actual save/restore (d12-d13).
Add comprehensive comments documenting 3D graphics concepts at each
pipeline stage, algorithm explanations, and variable origin notes.
One code change: cc_mselect == 1 → CC_MSELECT_CLOCAL (named constant).
Corrected interpreter FPS baselines (were far too low), reframed
JIT benefit around CPU utilization reduction (20-50%) rather than
overstated 8x FPS gains. Replaced defunct 3dfxarchive.com with
VOGONS Wiki and VOGONS Vintage Driver Library.
Two-command workflow for new contributors:
- `deps`: installs all required Homebrew packages
- `build`: clean configure + build + ad-hoc codesign with JIT entitlements
JIT debug logging can produce hundreds of MB in minutes — warn users
to only enable when actively debugging and clean up afterwards.
Documents all macro consolidation changes, what was/wasn't changed,
before/after examples, deferral rationale, and R7 endianness verification.
Docker-based build system for fully-bundled AppImages targeting
Raspberry Pi 5 and other aarch64 Linux systems:
- Dockerfile: Debian Bullseye ARM64 build environment
- AppImageBuilder.yml: appimage-builder recipe matching official 86Box
- build.sh: host-side launcher with persistent cmake cache volume
- README.md: step-by-step build guide
- .gitignore: excludes output/ directory

Bundles glibc 2.31, Qt5 + Wayland, all runtime deps — no install
needed on target. Reference added to testing guide.
Compile OpenAL 1.23.1, rtmidi 4.0.0, FluidSynth 2.3.0, and SDL2 2.0.20
from source with minimal features to eliminate transitive dependencies
(libnsl, libjack, libpulse) that cause runtime failures on Fedora.

- Dockerfile: add build deps for source compilation, pin appimage-builder
  to same commit as official (22fefa2), remove distro -dev packages for
  the four libs
- build-appimage.sh: add Step 0 to compile libs with guards for rebuild
  caching, set PKG_CONFIG_PATH for custom libs, copy .so files into AppDir,
  use source tree .desktop and .metainfo.xml
- AppImageBuilder.yml: trim ~40 packages (crypto/TLS, systemd/udev, excess
  Qt modules, image format libs, cups/avahi), add file exclusions matching
  official, add required comp: gzip field
- README.md: update bundled libraries documentation

Tested on Fedora ARM64 (no missing lib errors) and Raspberry Pi OS.
AppImage size reduced from 77 MB to 65 MB.
New scripts/analyze-jit-log.py performs automated health analysis of
Voodoo JIT debug logs — single streaming pass handles multi-GB logs,
checks compilation stats, interpreter fallbacks, pipeline coverage,
pixel output quality, and produces a summary verdict.

Also adds a VOODOO JIT: INIT line at the start of debug logs (ARM64
only) recording render_threads, use_recompiler, and jit_debug level.
Document the analyze-jit-log.py script in the debug logging section
and update the issue reporting workflow to run the analyzer first.
Remove two dead code emissions identified by the codegen audit:

- MINOR-1: MOV_V(5, 1) in dual-TMU combine saved v1 into v5 but v5
  was never read back. ARM64 SMULL_4S_4H doesn't clobber inputs like
  x86-64's PMULLW sequence, so the save was unnecessary.

- MINOR-3: MOVI_V2D_ZERO(2) zeroed v2 as an unpacking constant but
  v2 was never referenced. ARM64 uses UXTL for byte-to-halfword
  widening instead of x86-64's PUNPCKLBW which needs a zero register.

Saves 8 bytes per generated code block. Update audit report to mark
both items as resolved.
H7: replace 14 LDR d16 (alookup[1]) loads with pinned v8 register
H8: eliminate 14 redundant MOV v17 before USHR (use different Rd/Rn)

Saves 28 instructions per alpha-blended pixel, no functional change.
skiretic and others added 28 commits February 20, 2026 22:59
22GB log (323M lines) analyzed in 38s — VERDICT: HEALTHY.
75,041 blocks, 276 pipeline configs, 5.2B pixels, 0 errors.
…+M5+M6+L1+L2)

LDP pairing at 4 sites (M1), hoist fogColor to v11 replacing dead
neon_minus_254 (M3), extract TMU0 alpha once in dual-TMU TCA (M5),
BFI for no-dither RGB565 packing (M6), CBZ for tmu_w guard (L1),
eliminate MOV w11,w7 in texture fetch (L2).

New encoding macros: ARM64_BFI, ARM64_CBZ_X_PLACEHOLDER,
ARM64_CBNZ_X_PLACEHOLDER.

Tested: Q3, Turok, 3DMark99, 3DMark2000, UT99. 68k blocks compiled,
4.4B pixels, 274 pipeline configs, zero errors. Full feature parity
audit against x86-64 reference confirmed.
…8: R2-24+R2-25+R2-13+R2-27)

- Move STATE_x LDR before loop — redundant reload every iteration (R2-24)
- Eliminate MOV w4,w28 in loop control — use w28 directly, reorder CMP before MOV (R2-25)
- Remove redundant MOV v16,v0 in color combine multiply — v0 unmodified before SMULL (R2-13)
- MVN directly from w28 in stipple pattern — eliminate intermediate MOV copy (R2-27)

Saves ~3 instructions/pixel unconditionally, +1 in stipple paths.
…t Batch 8

Fix: TMU0 alpha extraction (FMOV w13, s7 + LSR w13, 24) was placed at
the TCA section start, AFTER the tca_sub_clocal block that reads w13.
Moved extraction before the SMULL so w13 is valid when first read.

This caused rendering artifacts when tca_sub_clocal was active in
dual-TMU texture combine paths.

Also reverts Batch 8 (R2-24/R2-25/R2-13/R2-27) pending re-validation.
LDP pairings (Batch 7/M1) verified correct against struct layout.
…r counter

Batch 8 (Round 2 audit): R2-24 (STATE_x LDR before loop), R2-25 (eliminate
MOV w4,w28 in loop control), R2-13 (remove redundant MOV v16,v0 in cc
multiply), R2-27 (MVN directly from w28 in stipple). ~4 insns/pixel saved.

Debug logging: removed the < 50 cap on EXECUTE/POST logging so full session
logs are generated. Added jit_interp_count to track interpreter fallback
executions, with a summary line at shutdown (jit_exec vs interp_exec vs gen).

Tested: 5.5B pixels, 351 configs, 174M JIT executions, 0 errors (Voodoo 3).
Consolidated per-batch and per-category instruction savings table.
~84-105 insns/pixel removed across 8 batches. Updated Batch 8 status
to DONE with commit hash.
Batch 9, 10, D, JIT cache expansion, perf benchmarking, debug logging
levels — with savings estimates, risk levels, and recommended order.
Real workloads hit 180-351 unique pipeline configurations, causing
massive cache thrashing with only 8 slots (258K recompilations for
351 configs in Q3). Expanding to 32 slots captures the full working
set after warmup. Memory cost: 512KB -> 2MB (well under 4MB budget).
…+R2-08)

R2-07: Eliminate ebp_store memory round-trip in bilinear path. Hold bilinear
lookup index in w17 (IP1 scratch) instead of STR/LDR through STATE_ebp_store.
Saves 1 STR + 1 LDR per bilinear-textured pixel.

R2-12: Replace 8 FMOV+DUP_V4H_LANE(x,x,0) pairs with single DUP_V4H_GPR.
All sites broadcast values in 0-255 range (alpha, LOD frac, detail blend),
so 16-bit GPR-to-vector broadcast is semantically identical. Sites:
TC_MSELECT_DETAIL (TMU0/1), TC_MSELECT_LOD_FRAC (TMU0/1),
CC_MSELECT_AOTHER, CC_MSELECT_ALOCAL (both paths), CC_MSELECT_TEX.

R2-08: Cache original LOD in w11 before ADD w6,w6,#4 in point-sample path.
Eliminates LDR w11,[x0,#STATE_lod] reload in S clamp/wrap section.
R2-23: EMIT_MOV_IMM64, EMIT_LOAD_NEON_CONST, and dither_rb_addr now
skip zero halfwords -- MOVZ targets the first non-zero halfword, then
MOVK only for remaining non-zero ones. Saves ~10 instructions per
block across 4 sites (7 prologue ptrs + 3 NEON consts + 1 dither ptr).
New ARM64_MOVZ_X_HW(d, imm16, hw) encoding macro added.

R2-09: (1<<48) dividend for perspective W division now uses a single
MOVZ X4, #1, LSL 86Box#48 instead of 4 instructions (MOVZ #0 + 3 MOVK).
Saves 3 instructions per perspective-textured block.
FOG_W: the interpreter computes fog_a = (w >> 32) & 0xff, masking to
the low byte before clamping (making clamp a no-op). The JIT was doing
CLAMP(w >> 32) without the mask, so when w >> 32 exceeded 255 the JIT
would saturate to 0xFF while the interpreter wrapped to the low byte.
This caused 20 VERIFY MISMATCH errors (224 differing pixels across
844M rendered) in fog-W pipeline configs.

Fix: replace the 4-instruction BIC+CMP+CSEL clamp with a single AND
#0xFF, matching the interpreter exactly.

FOG_Z: the interpreter uses (z >> 20) & 0xff but the JIT (copied from
x86-64) was using z >> 12. Correct the shift to match the interpreter.
This path was not triggered by the current test suite but is a latent
correctness bug.
Verify mode (jit_debug=2) fixes:
- Add EXECUTE/POST/PIXEL logging inside verify block (was skipped
  due to !jit_verify_active guard on the normal JIT path)
- Replace alloca() with malloc()/free() for verify buffers to prevent
  stack overflow (SIGBUS) on long runs
- Both changes are verify-mode-only, no impact on normal JIT operation

Enhanced log analyzer (scripts/analyze-jit-log.c):
- Per-fogMode mismatch breakdown with fog enabled/disabled annotation
- Top 10 pipeline config mismatch table
- Diff magnitude distribution (±0-1, ±2-3, ±4-6, ±7+ buckets)
- Max absolute |dR|, |dG|, |dB| tracking
- Match rate percentage calculation
- All counters are per-worker thread, merged after scan

Documentation:
- verify-mismatch-analysis.md: full root cause analysis concluding
  all verify mismatches are test harness artifacts, not real JIT bugs.
  Documents x86-64 JIT's divide-by-2 fog rounding technique (inherited
  by ARM64 port) with line-level source citations.
- CHANGELOG.md: complete verify mode debugging chronicle
- TESTING-GUIDE.md: expanded verify mode section with known limitations
  and recommended validation workflow
- optimization-summary.md: added verify mode findings, updated remaining
  work items (cache expansion and debug levels now done)
- Preserved raw data: final-validation-log.txt, verify-mismatch-pre-fog-fix.txt
…fied

4 parallel verification agents analyzed suspected JIT-interpreter differences:
- Finding 1 (alpha blend /255): FALSE ALARM — math is identical
- Finding 2 (TMU1 RGB negate ordering): real ±1, inherited from x86-64
- Finding 3 (AFUNC_ASATURATE): real bug in INTERPRETER (dest_r instead of src_r)
- Finding 4 (zaColor depth bias): real clamp-vs-truncate, inherited from x86-64

Zero ARM64-specific bugs found. ASATURATE interpreter fix pending.
…h interpreter

Two accuracy fixes making the ARM64 JIT more accurate than the x86-64 JIT:

Finding 2: TMU1 tc_sub_clocal negate ordering — the interpreter computes
(-clocal * factor) >> 8 (negate first), both JITs computed -(clocal * factor >> 8)
(negate last), producing ±1 when clocal*factor % 256 != 0. Fixed by negating
clocal before the widening multiply. Same instruction count (4).

Finding 4: zaColor depth bias clamp — the interpreter uses CLAMP16() to clamp
depth+bias to [0, 0xFFFF], both JITs truncated via AND/UXTH (wrapping). Fixed
with SXTH sign-extension + CMP/CSEL clamp sequence matching the interpreter.

Verify mode: mismatches dropped 68% (2.9M→926K events), differing pixels
dropped 74% (22.3M→5.8M), match rate improved 99.24%→99.61%.
jit_debug=1 remains HEALTHY (zero errors).
Remove Linux ARM64 AppImage bullet from README (not distributing builds).
Move completed optimization/audit docs to voodoo-arm64-port/archive/.
Delete README2.md draft.
- scripts/analyze-jit-log.c: POSIX version (mmap + pthreads)
- scripts/analyze-jit-log-win32.c: Win32 version (CreateFileMapping + _beginthreadex)
- scripts/README-jit-analyzer.md: usage documentation
Clean up comments in vid_voodoo_codegen_arm64.h (no code changes):
- Remove optimization batch/tracking IDs (Batch N/XX, RN-XX, H4, M1, etc.)
- Delete stale comments referencing removed code
- Rename leftover x86-64 xmm_ prefixes to neon_ to match actual variable names
- Fix incorrect bit-range notation in RGB565 packing comment
- Fix misidentified register values and alignment claims
- Clarify opaque internal references and ambiguous notation
- Restore line breaks collapsed by earlier comment deletions
Pad params_read_idx, params_write_idx, and render_voodoo_busy to
separate 128-byte cache lines on ARM64 to eliminate false sharing
between render threads and the FIFO/CPU thread. Add a short spin-wait
(256 yield iterations) in render_thread() before sleeping to absorb
burst triangle submissions without context-switch overhead.

All changes guarded behind __aarch64__ || _M_ARM64 — zero impact on
x86-64 or other platforms.
Document render_threads = 1 as recommended for ARM64 JIT on Apple
Silicon. CPU emulation runs on E-cores which get starved when multiple
render threads heavily load P-cores. Note that symmetric ARM64 platforms
(RPi 4/5) may behave differently.
Three JIT infrastructure improvements:

1. Move DEPTHOP_NEVER and AFUNC_NEVER checks before the prologue.
   Previously emitted bare RET after 20 registers were saved and SP
   decremented by 176 bytes — an ABI violation. Now emits RET before
   any register saves, making it correct.

2. Replace round-robin cache eviction with LRU. Each slot carries a
   last_used timestamp from a per-partition monotonic counter. On miss,
   the slot with the smallest timestamp is evicted. Rejected slots get
   last_used=0 for immediate eviction. Array layout changed from
   interleaved (stride-4) to contiguous per-partition for better cache
   locality. voodoo_generate() now returns actual code size; I-cache
   flush narrowed from full BLOCK_SIZE to actual emitted bytes.

3. Skip redundant mprotect toggles on Linux ARM64 where JIT pages are
   allocated RWX. The arm64_jit_rwx flag short-circuits set_writable
   and set_executable, avoiding TLB shootdown overhead. No-op on macOS
   and Windows where W^X is enforced.

Also: log analyzer updated for new GENERATE format (code_size= replaces
block=, output shows min/avg/max code size stats), technical reference
updated, comment fixes applied.
… voodoo_t

Fixes shared LRU eviction state between SLI cards — file-static
jit_generation[4] meant the second card's init would reset the first
card's LRU timestamps. Now each voodoo_t owns its own counters.
@OBattler OBattler merged commit bca9de2 into 86Box:master Mar 8, 2026
10 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants