Skip to content

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620

Merged
mvaligursky merged 3 commits into
mainfrom
mv-radix-8bit-subgroup
Apr 24, 2026
Merged

feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620
mvaligursky merged 3 commits into
mainfrom
mv-radix-8bit-subgroup

Conversation

@mvaligursky

@mvaligursky mvaligursky commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • New: ComputeOneSweepRadixSort — single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder. Ported from b0nes164/GPUSorting (Thomas Smith, MIT). NVIDIA Turing+ only — requires forward-thread-progress guarantees that Apple Silicon, Mali and Adreno do not provide reliably. Not wired up in production yet.
  • Improved: existing ComputeRadixSort (4-bit multi-pass) — decouples dispatch/scan size from buffer capacity so a small sort following a large one no longer pays the full capacity cost. Also adds a radixBits getter so callers can align key-bit counts generically across sort backends.
  • New: interactive benchmark example (test/radix-sort-compute) comparing both backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional CPU-reference validation.

No public API changes

Both ComputeRadixSort and ComputeOneSweepRadixSort are tagged @ignore — they are engine-internal and not part of the published API. The radixBits getter added to ComputeRadixSort is likewise internal, used by the gsplat manager / benchmark to route key-bit counts between backends.

Production routing

gsplat-manager.js continues to use new ComputeRadixSort(device) unconditionally. A follow-up PR will add a thin facade that routes NVIDIA adapters to ComputeOneSweepRadixSort. This PR is deliberately just "make the backend available + benchmark it".

Dependencies

Depends on device.minSubgroupSize / device.maxSubgroupSize added in #8645 (already on main). ComputeOneSweepRadixSort parametrises its shared-memory layout from device.maxSubgroupSize to support 16 / 32 / 64 / 128 lane subgroup hardware from the same source.

Performance

Sort-only GPU time (Forward pass excluded), 24-bit keys, 20 warmup + 50 measured frames per cell. OneSweep-safe below is the CSDLDF-backed fallback that got cut.

NVIDIA RTX 2070 (Turing) — OneSweep dominates across the full range (up to ~5× faster than 4-bit at 25M+). 8-coalesced is a respectable second but has no routing advantage over OneSweep. OneSweep-safe (CSDLDF) collapses badly past 5M — the reason the CSDLDF fallback was dropped.

2070

NVIDIA Quadro M2000 (Ampere) — same shape as Turing: OneSweep wins, CSDLDF-backed variant degrades sharply past a few million elements.

Screenshot 2026-04-23 at 18 21 13

Apple M1 (metal-3) — 4-bit is the clear winner across the range that matters (4–8M target). 8-ranked regresses past 5M (0.53× at 20M) and 8-coalesced is similar. OneSweep (plain and CSDLDF-safe) is non-competitive on Apple because lookback stalls are uncapped. This chart drove the "ship 4-bit as the single portable fallback" decision.

m1

Apple M4 (metal-3) — very different story: 8-ranked is 1.2–1.7× faster than 4-bit across every size and even beats OneSweep. M4 is not the Apple priority though (it already has a fast local-compute GSplat path), and since we ship one Apple fallback, M1/M2 win the tie-break. Leaving 8-ranked in the branch history for a potential future M4-specific routing.

M4

Experiments that didn't make it

Several alternative radix-sort variants were explored before settling on "4-bit portable + OneSweep on NVIDIA". Listing them here so the reasoning isn't lost — the history is preserved on branch mv-radix-8bit-subgroup up to commit be5ffb304 if anyone needs to resurrect one.

  • CSDLDF (Chained-Scan Decoupled-Lookback Decoupled-Fallback) scan kernel, plus a dedicated scan-kernel factory that switched between CSDLDF and Blelloch. Intended to give OneSweep a portable fallback for devices without forward-thread-progress guarantees. On NVIDIA it validates correctly but serialises badly (tens to >100× slower than plain lookback at larger sizes). On Apple M4 it's slightly faster than plain OneSweep at 2–3M (10–20% win), but both OneSweep variants are dominated by 4-bit on Apple, so the theoretical advantage doesn't translate into a real win for any device in the current matrix. Removed for now — the algorithm itself is sound and may be worth resurrecting if a device shows up where plain lookback deadlocks and CSDLDF beats 4-bit.

  • 8-bit shared-memory radix sort (no subgroup intrinsics). Wider radix → fewer passes, but the resulting 256-bucket bitmasks blow out shared memory enough that Mali hit a recursion cliff and M4 regressed vs 4-bit. Removed.

  • 8-bit subgroup radix sort in four reorder variants (ballot, packed, ranked, coalesced). Uses subgroup intrinsics for ranking. Genuinely strong on Apple M4 — ranked is 1.2–1.7× faster than 4-bit across every size and beats OneSweep too. However, M1/M2 are the Apple priority (M4 already has a fast local-compute GSplat path), and on M1/M2 the ranked variant regresses past 5M (0.53× at 20M). Since we only ship one Apple fallback, M1/M2 perf wins, so 4-bit is the safer pick. 8-coalesced is also competitive on NVIDIA Ampere up to ~4M but OneSweep covers that range comfortably. Removed for simplicity — if an M4-specific (or desktop-class Apple) routing becomes worthwhile, ranked is the variant to revisit.

  • 6-bit shared-memory radix sort, with two optimisations on top (bank-conflict-friendly bitmask layout, shadow-copy bitmask reads). Attractive on paper as the middle ground between 4-bit and 8-bit. Apple M4 and Pixel 8 Pro looked promising, but Apple M1/M2 regressed vs 4-bit in the 4–8M target range (the range that actually matters for those devices — M1/M2 lacks the compute perf headroom of M4). Shadow-copy was reverted to test whether LDS-occupancy pressure explained the regression; the regression remained, so the cause is architectural rather than LDS pressure. Removed.

Net result: one portable backend (ComputeRadixSort, 4-bit) that wins or is close-second on every target, plus one NVIDIA-specialised backend (ComputeOneSweepRadixSort, OneSweep) for where the hardware/driver can actually deliver on its lookback-based fusion.

Platform coverage / benchmarks

Benchmarking was done on:

  • NVIDIA RTX 2070 (Turing), NVIDIA Quadro M2000 (Ampere)
  • Apple M1, M2, M4
  • iPhone 13 Pro (A15)
  • Android Pixel 8 Pro (Tensor G3 / Mali-G715 Valhall)
  • Android Pixel 10 Pro (Tensor G5 / Imagination IMG)

For the Unified GSplat manager's dynamic 10–20 bit key range, ComputeRadixSort validates and performs consistently across the full platform matrix; ComputeOneSweepRadixSort validates and wins by a meaningful margin on NVIDIA but fails validation on the mobile SoCs tested.

Test plan

  • Run benchmark on NVIDIA (RTX 2070+): both backends validate across 1K–50M elements and 4/8/16/24/32-bit widths
  • Run benchmark on Apple M1/M2/M4: ComputeRadixSort validates across full range; OneSweep should remain disabled
  • Run benchmark on Android (Pixel 8/10 Pro): ComputeRadixSort validates across full range
  • Existing GSplat scenes using gsplat-manager.js continue to sort correctly (still routes to ComputeRadixSort)

@LeXXik

LeXXik commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

3090:

image

Note: WebGPU GPU Sort step breaks here - black screen.

- ComputeRadixSort: 4-bit shared-memory multi-pass radix sort, WebGPU-
  only, runs on every device with no subgroup requirement. Selected as
  the portable fallback after benchmarking wider radixes (6-bit, 8-bit
  shared, 8-bit subgroup variants) on Apple M1/M2/M4, NVIDIA, and
  Android Mali/IMG; 4-bit is the single most consistent choice in every
  target range.

- ComputeOneSweepRadixSort: single-sweep 8-bit radix sort with fused
  histogram / decoupled-lookback / reorder, ported from b0nes164/
  GPUSorting. NVIDIA Turing+ only (gated by callers) -- requires
  forward-thread-progress guarantees that Apple Silicon and mobile
  Mali/Adreno do not provide reliably.

- Radix-sort benchmark example (test/radix-sort-compute): compares the
  two backends across 4/8/16/24/32-bit key widths and element counts
  from 1K to 50M, with optional validation against a CPU reference
  implementation.
@mvaligursky mvaligursky changed the title feat: experimental 8-bit radix sort with CSDLDF scan + subgroup variants feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA) Apr 23, 2026
Leftover from an earlier iteration that had a scan-kernel factory with
Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes
device, so the object was silently ignored.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an engine-internal WebGPU OneSweep radix sort backend (plus WGSL kernels), improves the existing portable 4-bit compute radix sort’s dispatch/scan sizing, and ships an interactive benchmark/validation example to compare the backends.

Changes:

  • Add OneSweep WGSL kernels (global histogram, scan, fused binning/lookback).
  • Add ComputeOneSweepRadixSort implementation and export it.
  • Update ComputeRadixSort to decouple dispatch/scan size from buffer capacity and expose radixBits; expand the example to benchmark/validate both modes.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-scan.js New WGSL scan kernel for OneSweep histogram prefixes.
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-global-hist.js New WGSL fused global histogram kernel (vec4 loads).
src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-binning.js New WGSL fused binning + lookback kernel (OneSweep core).
src/scene/graphics/compute-radix-sort.js Dispatch/scan sizing fixed to follow current elementCount; add radixBits; rename GPU profiler pass labels.
src/scene/graphics/compute-onesweep-radix-sort.js New OneSweep compute sort backend, shaders/formats, buffer management, dispatch flow.
src/index.js Export ComputeOneSweepRadixSort.
examples/src/examples/test/radix-sort-compute.example.mjs Add mode selection, overlays, benchmark + validation harnesses.
examples/src/examples/test/radix-sort-compute.controls.mjs Add UI controls for mode/render/validation and benchmark/validate actions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scene/graphics/compute-onesweep-radix-sort.js Outdated
Comment thread src/scene/graphics/compute-onesweep-radix-sort.js
Comment thread examples/src/examples/test/radix-sort-compute.example.mjs
Comment thread src/scene/graphics/compute-onesweep-radix-sort.js
@mvaligursky mvaligursky marked this pull request as ready for review April 24, 2026 08:32
@mvaligursky mvaligursky merged commit 6a2f905 into main Apr 24, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-radix-8bit-subgroup branch April 24, 2026 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants