feat: add WebGPU compute radix sort (4-bit portable + OneSweep NVIDIA)#8620
Merged
Conversation
Contributor
- ComputeRadixSort: 4-bit shared-memory multi-pass radix sort, WebGPU- only, runs on every device with no subgroup requirement. Selected as the portable fallback after benchmarking wider radixes (6-bit, 8-bit shared, 8-bit subgroup variants) on Apple M1/M2/M4, NVIDIA, and Android Mali/IMG; 4-bit is the single most consistent choice in every target range. - ComputeOneSweepRadixSort: single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder, ported from b0nes164/ GPUSorting. NVIDIA Turing+ only (gated by callers) -- requires forward-thread-progress guarantees that Apple Silicon and mobile Mali/Adreno do not provide reliably. - Radix-sort benchmark example (test/radix-sort-compute): compares the two backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional validation against a CPU reference implementation.
ef7dd6c to
f107d20
Compare
Leftover from an earlier iteration that had a scan-kernel factory with Blelloch and CSDLDF variants. PrefixSumKernel.constructor only takes device, so the object was silently ignored.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an engine-internal WebGPU OneSweep radix sort backend (plus WGSL kernels), improves the existing portable 4-bit compute radix sort’s dispatch/scan sizing, and ships an interactive benchmark/validation example to compare the backends.
Changes:
- Add OneSweep WGSL kernels (global histogram, scan, fused binning/lookback).
- Add
ComputeOneSweepRadixSortimplementation and export it. - Update
ComputeRadixSortto decouple dispatch/scan size from buffer capacity and exposeradixBits; expand the example to benchmark/validate both modes.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-scan.js | New WGSL scan kernel for OneSweep histogram prefixes. |
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-global-hist.js | New WGSL fused global histogram kernel (vec4 loads). |
| src/scene/shader-lib/wgsl/chunks/radix-sort/onesweep-binning.js | New WGSL fused binning + lookback kernel (OneSweep core). |
| src/scene/graphics/compute-radix-sort.js | Dispatch/scan sizing fixed to follow current elementCount; add radixBits; rename GPU profiler pass labels. |
| src/scene/graphics/compute-onesweep-radix-sort.js | New OneSweep compute sort backend, shaders/formats, buffer management, dispatch flow. |
| src/index.js | Export ComputeOneSweepRadixSort. |
| examples/src/examples/test/radix-sort-compute.example.mjs | Add mode selection, overlays, benchmark + validation harnesses. |
| examples/src/examples/test/radix-sort-compute.controls.mjs | Add UI controls for mode/render/validation and benchmark/validate actions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
ComputeOneSweepRadixSort— single-sweep 8-bit radix sort with fused histogram / decoupled-lookback / reorder. Ported from b0nes164/GPUSorting (Thomas Smith, MIT). NVIDIA Turing+ only — requires forward-thread-progress guarantees that Apple Silicon, Mali and Adreno do not provide reliably. Not wired up in production yet.ComputeRadixSort(4-bit multi-pass) — decouples dispatch/scan size from buffer capacity so a small sort following a large one no longer pays the full capacity cost. Also adds aradixBitsgetter so callers can align key-bit counts generically across sort backends.test/radix-sort-compute) comparing both backends across 4/8/16/24/32-bit key widths and element counts from 1K to 50M, with optional CPU-reference validation.No public API changes
Both
ComputeRadixSortandComputeOneSweepRadixSortare tagged@ignore— they are engine-internal and not part of the published API. TheradixBitsgetter added toComputeRadixSortis likewise internal, used by the gsplat manager / benchmark to route key-bit counts between backends.Production routing
gsplat-manager.jscontinues to usenew ComputeRadixSort(device)unconditionally. A follow-up PR will add a thin facade that routes NVIDIA adapters toComputeOneSweepRadixSort. This PR is deliberately just "make the backend available + benchmark it".Dependencies
Depends on
device.minSubgroupSize/device.maxSubgroupSizeadded in #8645 (already on main).ComputeOneSweepRadixSortparametrises its shared-memory layout fromdevice.maxSubgroupSizeto support 16 / 32 / 64 / 128 lane subgroup hardware from the same source.Performance
Sort-only GPU time (Forward pass excluded), 24-bit keys, 20 warmup + 50 measured frames per cell.
OneSweep-safebelow is the CSDLDF-backed fallback that got cut.NVIDIA RTX 2070 (Turing) — OneSweep dominates across the full range (up to ~5× faster than 4-bit at 25M+).
8-coalescedis a respectable second but has no routing advantage over OneSweep.OneSweep-safe(CSDLDF) collapses badly past 5M — the reason the CSDLDF fallback was dropped.NVIDIA Quadro M2000 (Ampere) — same shape as Turing: OneSweep wins, CSDLDF-backed variant degrades sharply past a few million elements.
Apple M1 (metal-3) — 4-bit is the clear winner across the range that matters (4–8M target).
8-rankedregresses past 5M (0.53× at 20M) and8-coalescedis similar. OneSweep (plain and CSDLDF-safe) is non-competitive on Apple because lookback stalls are uncapped. This chart drove the "ship 4-bit as the single portable fallback" decision.Apple M4 (metal-3) — very different story:
8-rankedis 1.2–1.7× faster than 4-bit across every size and even beats OneSweep. M4 is not the Apple priority though (it already has a fast local-compute GSplat path), and since we ship one Apple fallback, M1/M2 win the tie-break. Leaving8-rankedin the branch history for a potential future M4-specific routing.Experiments that didn't make it
Several alternative radix-sort variants were explored before settling on "4-bit portable + OneSweep on NVIDIA". Listing them here so the reasoning isn't lost — the history is preserved on branch
mv-radix-8bit-subgroupup to commitbe5ffb304if anyone needs to resurrect one.CSDLDF (Chained-Scan Decoupled-Lookback Decoupled-Fallback) scan kernel, plus a dedicated scan-kernel factory that switched between CSDLDF and Blelloch. Intended to give OneSweep a portable fallback for devices without forward-thread-progress guarantees. On NVIDIA it validates correctly but serialises badly (tens to >100× slower than plain lookback at larger sizes). On Apple M4 it's slightly faster than plain OneSweep at 2–3M (10–20% win), but both OneSweep variants are dominated by 4-bit on Apple, so the theoretical advantage doesn't translate into a real win for any device in the current matrix. Removed for now — the algorithm itself is sound and may be worth resurrecting if a device shows up where plain lookback deadlocks and CSDLDF beats 4-bit.
8-bit shared-memory radix sort (no subgroup intrinsics). Wider radix → fewer passes, but the resulting 256-bucket bitmasks blow out shared memory enough that Mali hit a recursion cliff and M4 regressed vs 4-bit. Removed.
8-bit subgroup radix sort in four reorder variants (
ballot,packed,ranked,coalesced). Uses subgroup intrinsics for ranking. Genuinely strong on Apple M4 —rankedis 1.2–1.7× faster than 4-bit across every size and beats OneSweep too. However, M1/M2 are the Apple priority (M4 already has a fast local-compute GSplat path), and on M1/M2 the ranked variant regresses past 5M (0.53× at 20M). Since we only ship one Apple fallback, M1/M2 perf wins, so 4-bit is the safer pick.8-coalescedis also competitive on NVIDIA Ampere up to ~4M but OneSweep covers that range comfortably. Removed for simplicity — if an M4-specific (or desktop-class Apple) routing becomes worthwhile,rankedis the variant to revisit.6-bit shared-memory radix sort, with two optimisations on top (bank-conflict-friendly bitmask layout, shadow-copy bitmask reads). Attractive on paper as the middle ground between 4-bit and 8-bit. Apple M4 and Pixel 8 Pro looked promising, but Apple M1/M2 regressed vs 4-bit in the 4–8M target range (the range that actually matters for those devices — M1/M2 lacks the compute perf headroom of M4). Shadow-copy was reverted to test whether LDS-occupancy pressure explained the regression; the regression remained, so the cause is architectural rather than LDS pressure. Removed.
Net result: one portable backend (
ComputeRadixSort, 4-bit) that wins or is close-second on every target, plus one NVIDIA-specialised backend (ComputeOneSweepRadixSort, OneSweep) for where the hardware/driver can actually deliver on its lookback-based fusion.Platform coverage / benchmarks
Benchmarking was done on:
For the Unified GSplat manager's dynamic 10–20 bit key range,
ComputeRadixSortvalidates and performs consistently across the full platform matrix;ComputeOneSweepRadixSortvalidates and wins by a meaningful margin on NVIDIA but fails validation on the mobile SoCs tested.Test plan
ComputeRadixSortvalidates across full range; OneSweep should remain disabledComputeRadixSortvalidates across full rangegsplat-manager.jscontinue to sort correctly (still routes toComputeRadixSort)