Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds by adenzler-nvidia · Pull Request #2546 · newton-physics/newton

adenzler-nvidia · 2026-04-22T20:04:54Z

Description

Two latent wp.tid() / launch-dim mismatches that are silently benign on warp-lang 1.12.x but become out-of-bounds reads (or, for _tiled_sum_kernel, wildly wrong tile offsets) on 1.13 nightlies because of the templated launch_bounds_t<N> work landed in NVIDIA/warp#1270. Under the old shape[4] + ndim launch bounds, a single-variable wp.tid() unravelled to the first-dim index coord.i; under the new templated bounds, kernel_dim is inferred from the number of values unpacked from wp.tid() and the launch dim is normalised to match, so a scalar tid = wp.tid() now returns a flat index across all launch dims.

1. `apply_forces_kernel` in `example_selection_cartpole.py`

@wp.kernel
def apply_forces_kernel(joint_q: wp.array3d[float], joint_f: wp.array3d[float]):
    tid = wp.tid()
    if joint_q[tid, 0, 0] > 0.0:
        ...

wp.launch(apply_forces_kernel, dim=joint_f.shape, inputs=[joint_q, joint_f])

The launch passes a 3-tuple shape (nworld, 1, ndof). Old: tid ∈ [0, nworld) — the kernel just ran the same work ndof times per world. New: tid ∈ [0, nworld*ndof), driving joint_q[tid, 0, 0] out of bounds.

The kernel only needs one thread per world, so the launch is fixed to dim=joint_f.shape[0].

2. `_tiled_sum_kernel` in `solve_rheology.py`

@wp.kernel
def _tiled_sum_kernel(data, partial_sums):
    block_id = wp.tid()
    tile = wp.tile_load(data[0], shape=TILE_SIZE, offset=block_id * TILE_SIZE)
    ...

wp.launch(_tiled_sum_kernel, dim=(num_blocks, tile_size), ..., block_dim=tile_size, record_cmd=True)

This is a manual 2-D launch equivalent to the old wp.launch_tiled desugaring (pre-GH-1270 launch_tiled appended block_dim onto dim). Under the new semantics, kernel_dim=1 collapses the launch to a flat (num_blocks*tile_size,), and block_id * TILE_SIZE walks far past the data buffer.

Fixed by unpacking the second index: block_id, _ = wp.tid(). That's launcher-agnostic — works with both the current wp.launch(dim=(num_blocks, tile_size), block_dim=tile_size) and a future wp.launch_tiled(dim=num_blocks, block_dim=tile_size) (which takes the kernel_dim == ndim + 1 branch in _construct_tiled_bounds).

Not migrating to wp.launch_tiled here because the code caches a wp.Launch via record_cmd=True and calls Launch.set_dim(...) each iteration, and set_dim currently rebuilds bounds from scratch via launch_bounds_t(...), which drops the tiled=True flag. That looks like a separate follow-on issue in warp-lang worth fixing upstream.

Checklist

New or existing tests cover these changes (test_selection.example_selection_cartpole_cpu; test_implicit_mpm covers _tiled_sum_kernel on CUDA via ArraySquaredNorm)
The documentation is up to date with these changes
CHANGELOG.md has been updated (if user-facing change)

Test plan

On warp-lang==1.13.0.dev20260415 and 1.13.0.dev20260422, test_selection.example_selection_cartpole_cpu went from a 0–30% pass rate to 10/10 locally:
```
CUDA_VISIBLE_DEVICES= uv run --extra dev -m newton.tests \
    -k test_selection.example_selection_cartpole_cpu
```
Verified with wp.config.mode="debug" that the same test no longer trips array.h:457 (3-D index bounds) after the cartpole fix.
_tiled_sum_kernel is exercised by SolverImplicitMPM (CUDA-only path via ArraySquaredNorm). test_implicit_mpm covers it on CUDA.

Bug fix

Steps to reproduce (without this PR, against any warp-lang>=1.13.0.dev20260415):

CUDA_VISIBLE_DEVICES= uv run --extra dev -m newton.tests -k test_selection.example_selection_cartpole_cpu

Fails intermittently with AssertionError: -11 != 0 (SIGSEGV) or -6 (SIGABRT / double free or corruption). Under wp.config.mode="debug" the assertion is caught explicitly in warp/native/array.h:457 as an out-of-bounds read into the joint_q array.

Related

Also unblocks #2528 (warp-lang nightly bump), where the same test fails on every OS.

Reference: NVIDIA/warp#1385 (discussion of the surfaced bugs).

Summary by CodeRabbit

Bug Fixes
- Fixed kernel execution sizing in the CartPole example to ensure correct force application across batches, improving simulation accuracy.
- Resolved thread-indexing behavior in the implicit MPM rheology solver for stable and correct tile-based reductions, improving solver reliability.

apply_forces_kernel uses a single `tid = wp.tid()` and indexes its joint_q/joint_f arrays as [tid, 0, 0], so the launch is meant to iterate one thread per world. The call site was launching with dim=joint_f.shape (a 3-tuple like (nworld, 1, 3)), so wp.tid() was returning values larger than the world count. Under the old shape[4]+ndim launch_bounds_t, wp.tid() happened to unravel to the first-dimension index, which kept accesses in-bounds but redundantly computed the same forces ndof times per world. After upstream warp-lang's templated launch_bounds_t<N> change (newton-physicsGH-1270), kernel_dim is inferred from the number of values returned by wp.tid() and the user-provided dim is normalised to match. A single-variable wp.tid() now deduces kernel_dim=1, collapses the 3-D launch to a flat 1-D bounds, and wp.tid() returns values in [0, nworld*1*ndof), producing out-of-bounds reads into joint_q. Launch with dim=joint_f.shape[0] so each thread maps to exactly one world, matching the kernel body.

coderabbitai · 2026-04-22T20:05:08Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cdb48e14-1dca-4205-8254-5a8117d284bc

📥 Commits

Reviewing files that changed from the base of the PR and between bd1dcfd and a418272.

📒 Files selected for processing (1)

newton/_src/solvers/implicit_mpm/solve_rheology.py

📝 Walkthrough

Walkthrough

Two small fixes: the Warp kernel launch in the non-torch control path now uses the batch/world count (joint_f.shape[0]) as dim; a tiled reduction kernel unpacks wp.tid() into (block_id, _) to match a two-component return and preserve indexing semantics.

Changes

Cohort / File(s)	Summary
Kernel Launch Parameter `newton/examples/selection/example_selection_cartpole.py`	Changed `wp.launch` for `apply_forces_kernel` to pass `dim=joint_f.shape[0]` (batch/world count) instead of the full tensor shape.
Kernel Thread Indexing `newton/_src/solvers/implicit_mpm/solve_rheology.py`	Updated `_tiled_sum_kernel` to unpack `wp.tid()` as `(block_id, _)`, keeping reduction logic intact while matching the two-component tid signature.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main changes: fixing wp.tid() and launch-dim mismatches in kernel launches across two files, which is the core issue described in the PR objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

newton/examples/selection/example_selection_cartpole.py (1)
157-157: Consider adding a CHANGELOG entry under [Unreleased] → Fixed.

Per repo guidelines, user-facing changes should have a CHANGELOG.md entry. This fix resolves intermittent segfaults/heap corruption in the selection_cartpole example under recent Warp versions, which is user-visible. Suggest something like:
- Fix `apply_forces_kernel` launch dimension in the selection cartpole example to prevent out-of-bounds access under Warp's templated `launch_bounds_t`.
As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/examples/selection/example_selection_cartpole.py` at line 157, Add a
user-facing CHANGELOG.md entry under the [Unreleased] → Fixed section describing
the bugfix for the selection cartpole example: note that the launch dimension
for apply_forces_kernel in
newton/examples/selection/example_selection_cartpole.py was corrected to prevent
out-of-bounds access and intermittent segfaults/heap corruption under recent
Warp templated launch_bounds_t; use a concise message such as: "Fix
`apply_forces_kernel` launch dimension in the selection cartpole example to
prevent out-of-bounds access under Warp's templated `launch_bounds_t`."

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@newton/examples/selection/example_selection_cartpole.py`:
- Line 157: Add a user-facing CHANGELOG.md entry under the [Unreleased] → Fixed
section describing the bugfix for the selection cartpole example: note that the
launch dimension for apply_forces_kernel in
newton/examples/selection/example_selection_cartpole.py was corrected to prevent
out-of-bounds access and intermittent segfaults/heap corruption under recent
Warp templated launch_bounds_t; use a concise message such as: "Fix
`apply_forces_kernel` launch dimension in the selection cartpole example to
prevent out-of-bounds access under Warp's templated `launch_bounds_t`."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 66ee54be-4632-48ae-9349-cff7941996d3

📥 Commits

Reviewing files that changed from the base of the PR and between 63953d7 and bd1dcfd.

📒 Files selected for processing (1)

newton/examples/selection/example_selection_cartpole.py

_tiled_sum_kernel is launched with a 2-D dim=(num_blocks, tile_size) and block_dim=tile_size — a manual equivalent of the pre-newton-physicsGH-1270 wp.launch_tiled desugaring. The kernel body used `block_id = wp.tid()`, which inferred kernel_dim=1 under upstream warp-lang's templated launch_bounds_t<N> and caused the launch grid to collapse to a flat 1-D bounds, so wp.tid() returned values in [0, num_blocks*tile_size) instead of block indices in [0, num_blocks). Switch to `block_id, _ = wp.tid()` so kernel_dim=2 matches the launch arity. This is launcher-agnostic: the tid()-arity-2 form unravels correctly for both wp.launch(dim=(num_blocks, tile_size)) and wp.launch_tiled(dim=num_blocks, block_dim=tile_size) (which takes the ndim+1 branch in _construct_tiled_bounds).

codecov · 2026-04-22T20:39:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Replace the silent fold/pad in _normalize_launch_dim with a hard ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel, *, tiled=False) which canonicalizes dim, bypasses validation for kernels with no wp.tid() call (flattening any shape to total thread count via math.prod), and raises ValueError with a concrete migration hint listing 2-4 valid fix options computed from the specific call. Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via _construct_tiled_bounds, and the experimental JAX FFI paths (custom_call.py, ffi.py). Remove _normalize_launch_dim entirely. Launch.set_dim() preserves tiled semantics on recorded tiled launches. The launch_tiled rank-mismatch error type changes from RuntimeError to ValueError for consistency with the regular launch path; kernels that do not call wp.tid() continue to accept any dim. Previously excess dimensions were silently folded into the last slot, which could cause out-of-bounds indexing and heap corruption in downstream kernels — the trigger took a few hours of debugging to isolate in Newton (newton-physics/newton#2546) because the overflow accumulates until an allocator validator notices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Eric Shi <ershi@nvidia.com>

The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

adenzler-nvidia changed the title ~~Fix apply_forces_kernel launch dim in selection cartpole example~~ Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds Apr 22, 2026

c0d1f1ed mentioned this pull request Apr 22, 2026

Alert when wp.launch() dim has more dimensions than the kernel's wp.tid() unpacks NVIDIA/warp#1389

Open

eric-heiden approved these changes Apr 22, 2026

View reviewed changes

eric-heiden added this pull request to the merge queue Apr 22, 2026

Merged via the queue into newton-physics:main with commit 329f31c Apr 23, 2026
25 checks passed

shi-eric mentioned this pull request Apr 23, 2026

Hard-error on wp.launch() dim-rank mismatch NVIDIA/warp#1390

Closed

3 tasks

shi-eric mentioned this pull request Apr 24, 2026

[REQ] Enhance ease of use for grid-stride loop NVIDIA/warp#1270

Open

c0d1f1ed mentioned this pull request Apr 24, 2026

heap corruption / segfault during Python GC inside capture_while (found in 1.13.0.dev20260422) NVIDIA/warp#1385

Closed

preist-nvidia mentioned this pull request May 4, 2026

Clean up CHANGELOG before 1.2 release branch cut #2684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds#2546

Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds#2546
eric-heiden merged 2 commits into
newton-physics:mainfrom
adenzler-nvidia:adenzler/fix-cartpole-apply-forces-tid

adenzler-nvidia commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adenzler-nvidia commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. apply_forces_kernel in example_selection_cartpole.py

2. _tiled_sum_kernel in solve_rheology.py

Checklist

Test plan

Bug fix

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 22, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adenzler-nvidia commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

1. `apply_forces_kernel` in `example_selection_cartpole.py`

2. `_tiled_sum_kernel` in `solve_rheology.py`

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading