Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds#2546
Conversation
apply_forces_kernel uses a single `tid = wp.tid()` and indexes its joint_q/joint_f arrays as [tid, 0, 0], so the launch is meant to iterate one thread per world. The call site was launching with dim=joint_f.shape (a 3-tuple like (nworld, 1, 3)), so wp.tid() was returning values larger than the world count. Under the old shape[4]+ndim launch_bounds_t, wp.tid() happened to unravel to the first-dimension index, which kept accesses in-bounds but redundantly computed the same forces ndof times per world. After upstream warp-lang's templated launch_bounds_t<N> change (newton-physicsGH-1270), kernel_dim is inferred from the number of values returned by wp.tid() and the user-provided dim is normalised to match. A single-variable wp.tid() now deduces kernel_dim=1, collapses the 3-D launch to a flat 1-D bounds, and wp.tid() returns values in [0, nworld*1*ndof), producing out-of-bounds reads into joint_q. Launch with dim=joint_f.shape[0] so each thread maps to exactly one world, matching the kernel body.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughTwo small fixes: the Warp kernel launch in the non-torch control path now uses the batch/world count ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
newton/examples/selection/example_selection_cartpole.py (1)
157-157: Consider adding a CHANGELOG entry under [Unreleased] → Fixed.Per repo guidelines, user-facing changes should have a CHANGELOG.md entry. This fix resolves intermittent segfaults/heap corruption in the
selection_cartpoleexample under recent Warp versions, which is user-visible. Suggest something like:- Fix `apply_forces_kernel` launch dimension in the selection cartpole example to prevent out-of-bounds access under Warp's templated `launch_bounds_t`.As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/examples/selection/example_selection_cartpole.py` at line 157, Add a user-facing CHANGELOG.md entry under the [Unreleased] → Fixed section describing the bugfix for the selection cartpole example: note that the launch dimension for apply_forces_kernel in newton/examples/selection/example_selection_cartpole.py was corrected to prevent out-of-bounds access and intermittent segfaults/heap corruption under recent Warp templated launch_bounds_t; use a concise message such as: "Fix `apply_forces_kernel` launch dimension in the selection cartpole example to prevent out-of-bounds access under Warp's templated `launch_bounds_t`."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@newton/examples/selection/example_selection_cartpole.py`:
- Line 157: Add a user-facing CHANGELOG.md entry under the [Unreleased] → Fixed
section describing the bugfix for the selection cartpole example: note that the
launch dimension for apply_forces_kernel in
newton/examples/selection/example_selection_cartpole.py was corrected to prevent
out-of-bounds access and intermittent segfaults/heap corruption under recent
Warp templated launch_bounds_t; use a concise message such as: "Fix
`apply_forces_kernel` launch dimension in the selection cartpole example to
prevent out-of-bounds access under Warp's templated `launch_bounds_t`."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 66ee54be-4632-48ae-9349-cff7941996d3
📒 Files selected for processing (1)
newton/examples/selection/example_selection_cartpole.py
_tiled_sum_kernel is launched with a 2-D dim=(num_blocks, tile_size) and block_dim=tile_size — a manual equivalent of the pre-newton-physicsGH-1270 wp.launch_tiled desugaring. The kernel body used `block_id = wp.tid()`, which inferred kernel_dim=1 under upstream warp-lang's templated launch_bounds_t<N> and caused the launch grid to collapse to a flat 1-D bounds, so wp.tid() returned values in [0, num_blocks*tile_size) instead of block indices in [0, num_blocks). Switch to `block_id, _ = wp.tid()` so kernel_dim=2 matches the launch arity. This is launcher-agnostic: the tid()-arity-2 form unravels correctly for both wp.launch(dim=(num_blocks, tile_size)) and wp.launch_tiled(dim=num_blocks, block_dim=tile_size) (which takes the ndim+1 branch in _construct_tiled_bounds).
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>
Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>
Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>
Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>
Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>
The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))
Description
Two latent
wp.tid()/ launch-dim mismatches that are silently benign onwarp-lang1.12.x but become out-of-bounds reads (or, for_tiled_sum_kernel, wildly wrong tile offsets) on 1.13 nightlies because of the templatedlaunch_bounds_t<N>work landed inNVIDIA/warp#1270. Under the oldshape[4] + ndimlaunch bounds, a single-variablewp.tid()unravelled to the first-dim indexcoord.i; under the new templated bounds,kernel_dimis inferred from the number of values unpacked fromwp.tid()and the launch dim is normalised to match, so a scalartid = wp.tid()now returns a flat index across all launch dims.1.
apply_forces_kernelinexample_selection_cartpole.pyThe launch passes a 3-tuple shape
(nworld, 1, ndof). Old:tid ∈ [0, nworld)— the kernel just ran the same workndoftimes per world. New:tid ∈ [0, nworld*ndof), drivingjoint_q[tid, 0, 0]out of bounds.The kernel only needs one thread per world, so the launch is fixed to
dim=joint_f.shape[0].2.
_tiled_sum_kernelinsolve_rheology.pyThis is a manual 2-D launch equivalent to the old
wp.launch_tileddesugaring (pre-GH-1270launch_tiledappendedblock_dimontodim). Under the new semantics,kernel_dim=1collapses the launch to a flat(num_blocks*tile_size,), andblock_id * TILE_SIZEwalks far past the data buffer.Fixed by unpacking the second index:
block_id, _ = wp.tid(). That's launcher-agnostic — works with both the currentwp.launch(dim=(num_blocks, tile_size), block_dim=tile_size)and a futurewp.launch_tiled(dim=num_blocks, block_dim=tile_size)(which takes thekernel_dim == ndim + 1branch in_construct_tiled_bounds).Not migrating to
wp.launch_tiledhere because the code caches awp.Launchviarecord_cmd=Trueand callsLaunch.set_dim(...)each iteration, andset_dimcurrently rebuildsboundsfrom scratch vialaunch_bounds_t(...), which drops thetiled=Trueflag. That looks like a separate follow-on issue in warp-lang worth fixing upstream.Checklist
test_selection.example_selection_cartpole_cpu;test_implicit_mpmcovers_tiled_sum_kernelon CUDA viaArraySquaredNorm)CHANGELOG.mdhas been updated (if user-facing change)Test plan
warp-lang==1.13.0.dev20260415and1.13.0.dev20260422,test_selection.example_selection_cartpole_cpuwent from a 0–30% pass rate to 10/10 locally:wp.config.mode="debug"that the same test no longer tripsarray.h:457(3-D index bounds) after the cartpole fix._tiled_sum_kernelis exercised bySolverImplicitMPM(CUDA-only path viaArraySquaredNorm).test_implicit_mpmcovers it on CUDA.Bug fix
Steps to reproduce (without this PR, against any
warp-lang>=1.13.0.dev20260415):Fails intermittently with
AssertionError: -11 != 0(SIGSEGV) or-6(SIGABRT /double free or corruption). Underwp.config.mode="debug"the assertion is caught explicitly inwarp/native/array.h:457as an out-of-bounds read into thejoint_qarray.Related
Also unblocks #2528 (warp-lang nightly bump), where the same test fails on every OS.
Reference: NVIDIA/warp#1385 (discussion of the surfaced bugs).
Summary by CodeRabbit