Skip to content

Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds#2546

Merged
eric-heiden merged 2 commits into
newton-physics:mainfrom
adenzler-nvidia:adenzler/fix-cartpole-apply-forces-tid
Apr 23, 2026
Merged

Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds#2546
eric-heiden merged 2 commits into
newton-physics:mainfrom
adenzler-nvidia:adenzler/fix-cartpole-apply-forces-tid

Conversation

@adenzler-nvidia

@adenzler-nvidia adenzler-nvidia commented Apr 22, 2026

Copy link
Copy Markdown
Member

Description

Two latent wp.tid() / launch-dim mismatches that are silently benign on warp-lang 1.12.x but become out-of-bounds reads (or, for _tiled_sum_kernel, wildly wrong tile offsets) on 1.13 nightlies because of the templated launch_bounds_t<N> work landed in NVIDIA/warp#1270. Under the old shape[4] + ndim launch bounds, a single-variable wp.tid() unravelled to the first-dim index coord.i; under the new templated bounds, kernel_dim is inferred from the number of values unpacked from wp.tid() and the launch dim is normalised to match, so a scalar tid = wp.tid() now returns a flat index across all launch dims.

1. apply_forces_kernel in example_selection_cartpole.py

@wp.kernel
def apply_forces_kernel(joint_q: wp.array3d[float], joint_f: wp.array3d[float]):
    tid = wp.tid()
    if joint_q[tid, 0, 0] > 0.0:
        ...

wp.launch(apply_forces_kernel, dim=joint_f.shape, inputs=[joint_q, joint_f])

The launch passes a 3-tuple shape (nworld, 1, ndof). Old: tid ∈ [0, nworld) — the kernel just ran the same work ndof times per world. New: tid ∈ [0, nworld*ndof), driving joint_q[tid, 0, 0] out of bounds.

The kernel only needs one thread per world, so the launch is fixed to dim=joint_f.shape[0].

2. _tiled_sum_kernel in solve_rheology.py

@wp.kernel
def _tiled_sum_kernel(data, partial_sums):
    block_id = wp.tid()
    tile = wp.tile_load(data[0], shape=TILE_SIZE, offset=block_id * TILE_SIZE)
    ...

wp.launch(_tiled_sum_kernel, dim=(num_blocks, tile_size), ..., block_dim=tile_size, record_cmd=True)

This is a manual 2-D launch equivalent to the old wp.launch_tiled desugaring (pre-GH-1270 launch_tiled appended block_dim onto dim). Under the new semantics, kernel_dim=1 collapses the launch to a flat (num_blocks*tile_size,), and block_id * TILE_SIZE walks far past the data buffer.

Fixed by unpacking the second index: block_id, _ = wp.tid(). That's launcher-agnostic — works with both the current wp.launch(dim=(num_blocks, tile_size), block_dim=tile_size) and a future wp.launch_tiled(dim=num_blocks, block_dim=tile_size) (which takes the kernel_dim == ndim + 1 branch in _construct_tiled_bounds).

Not migrating to wp.launch_tiled here because the code caches a wp.Launch via record_cmd=True and calls Launch.set_dim(...) each iteration, and set_dim currently rebuilds bounds from scratch via launch_bounds_t(...), which drops the tiled=True flag. That looks like a separate follow-on issue in warp-lang worth fixing upstream.

Checklist

  • New or existing tests cover these changes (test_selection.example_selection_cartpole_cpu; test_implicit_mpm covers _tiled_sum_kernel on CUDA via ArraySquaredNorm)
  • The documentation is up to date with these changes
  • CHANGELOG.md has been updated (if user-facing change)

Test plan

  • On warp-lang==1.13.0.dev20260415 and 1.13.0.dev20260422, test_selection.example_selection_cartpole_cpu went from a 0–30% pass rate to 10/10 locally:
    CUDA_VISIBLE_DEVICES= uv run --extra dev -m newton.tests \
        -k test_selection.example_selection_cartpole_cpu
    
  • Verified with wp.config.mode="debug" that the same test no longer trips array.h:457 (3-D index bounds) after the cartpole fix.
  • _tiled_sum_kernel is exercised by SolverImplicitMPM (CUDA-only path via ArraySquaredNorm). test_implicit_mpm covers it on CUDA.

Bug fix

Steps to reproduce (without this PR, against any warp-lang>=1.13.0.dev20260415):

CUDA_VISIBLE_DEVICES= uv run --extra dev -m newton.tests -k test_selection.example_selection_cartpole_cpu

Fails intermittently with AssertionError: -11 != 0 (SIGSEGV) or -6 (SIGABRT / double free or corruption). Under wp.config.mode="debug" the assertion is caught explicitly in warp/native/array.h:457 as an out-of-bounds read into the joint_q array.

Related

Also unblocks #2528 (warp-lang nightly bump), where the same test fails on every OS.

Reference: NVIDIA/warp#1385 (discussion of the surfaced bugs).

Summary by CodeRabbit

  • Bug Fixes
    • Fixed kernel execution sizing in the CartPole example to ensure correct force application across batches, improving simulation accuracy.
    • Resolved thread-indexing behavior in the implicit MPM rheology solver for stable and correct tile-based reductions, improving solver reliability.

apply_forces_kernel uses a single `tid = wp.tid()` and indexes its
joint_q/joint_f arrays as [tid, 0, 0], so the launch is meant to
iterate one thread per world.  The call site was launching with
dim=joint_f.shape (a 3-tuple like (nworld, 1, 3)), so wp.tid() was
returning values larger than the world count.

Under the old shape[4]+ndim launch_bounds_t, wp.tid() happened to
unravel to the first-dimension index, which kept accesses in-bounds
but redundantly computed the same forces ndof times per world.
After upstream warp-lang's templated launch_bounds_t<N> change
(newton-physicsGH-1270), kernel_dim is inferred from the number of values returned
by wp.tid() and the user-provided dim is normalised to match.  A
single-variable wp.tid() now deduces kernel_dim=1, collapses the 3-D
launch to a flat 1-D bounds, and wp.tid() returns values in
[0, nworld*1*ndof), producing out-of-bounds reads into joint_q.

Launch with dim=joint_f.shape[0] so each thread maps to exactly one
world, matching the kernel body.
@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cdb48e14-1dca-4205-8254-5a8117d284bc

📥 Commits

Reviewing files that changed from the base of the PR and between bd1dcfd and a418272.

📒 Files selected for processing (1)
  • newton/_src/solvers/implicit_mpm/solve_rheology.py

📝 Walkthrough

Walkthrough

Two small fixes: the Warp kernel launch in the non-torch control path now uses the batch/world count (joint_f.shape[0]) as dim; a tiled reduction kernel unpacks wp.tid() into (block_id, _) to match a two-component return and preserve indexing semantics.

Changes

Cohort / File(s) Summary
Kernel Launch Parameter
newton/examples/selection/example_selection_cartpole.py
Changed wp.launch for apply_forces_kernel to pass dim=joint_f.shape[0] (batch/world count) instead of the full tensor shape.
Kernel Thread Indexing
newton/_src/solvers/implicit_mpm/solve_rheology.py
Updated _tiled_sum_kernel to unpack wp.tid() as (block_id, _), keeping reduction logic intact while matching the two-component tid signature.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main changes: fixing wp.tid() and launch-dim mismatches in kernel launches across two files, which is the core issue described in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
newton/examples/selection/example_selection_cartpole.py (1)

157-157: Consider adding a CHANGELOG entry under [Unreleased] → Fixed.

Per repo guidelines, user-facing changes should have a CHANGELOG.md entry. This fix resolves intermittent segfaults/heap corruption in the selection_cartpole example under recent Warp versions, which is user-visible. Suggest something like:

- Fix `apply_forces_kernel` launch dimension in the selection cartpole example to prevent out-of-bounds access under Warp's templated `launch_bounds_t`.

As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/examples/selection/example_selection_cartpole.py` at line 157, Add a
user-facing CHANGELOG.md entry under the [Unreleased] → Fixed section describing
the bugfix for the selection cartpole example: note that the launch dimension
for apply_forces_kernel in
newton/examples/selection/example_selection_cartpole.py was corrected to prevent
out-of-bounds access and intermittent segfaults/heap corruption under recent
Warp templated launch_bounds_t; use a concise message such as: "Fix
`apply_forces_kernel` launch dimension in the selection cartpole example to
prevent out-of-bounds access under Warp's templated `launch_bounds_t`."
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@newton/examples/selection/example_selection_cartpole.py`:
- Line 157: Add a user-facing CHANGELOG.md entry under the [Unreleased] → Fixed
section describing the bugfix for the selection cartpole example: note that the
launch dimension for apply_forces_kernel in
newton/examples/selection/example_selection_cartpole.py was corrected to prevent
out-of-bounds access and intermittent segfaults/heap corruption under recent
Warp templated launch_bounds_t; use a concise message such as: "Fix
`apply_forces_kernel` launch dimension in the selection cartpole example to
prevent out-of-bounds access under Warp's templated `launch_bounds_t`."

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 66ee54be-4632-48ae-9349-cff7941996d3

📥 Commits

Reviewing files that changed from the base of the PR and between 63953d7 and bd1dcfd.

📒 Files selected for processing (1)
  • newton/examples/selection/example_selection_cartpole.py

_tiled_sum_kernel is launched with a 2-D dim=(num_blocks, tile_size)
and block_dim=tile_size — a manual equivalent of the pre-newton-physicsGH-1270
wp.launch_tiled desugaring.  The kernel body used
`block_id = wp.tid()`, which inferred kernel_dim=1 under upstream
warp-lang's templated launch_bounds_t<N> and caused the launch grid
to collapse to a flat 1-D bounds, so wp.tid() returned values in
[0, num_blocks*tile_size) instead of block indices in [0, num_blocks).

Switch to `block_id, _ = wp.tid()` so kernel_dim=2 matches the launch
arity.  This is launcher-agnostic: the tid()-arity-2 form unravels
correctly for both wp.launch(dim=(num_blocks, tile_size)) and
wp.launch_tiled(dim=num_blocks, block_dim=tile_size) (which takes the
ndim+1 branch in _construct_tiled_bounds).
@adenzler-nvidia adenzler-nvidia changed the title Fix apply_forces_kernel launch dim in selection cartpole example Fix wp.tid() / launch-dim mismatches exposed by warp-lang 1.13 templated launch_bounds Apr 22, 2026
@codecov

codecov Bot commented Apr 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@eric-heiden eric-heiden added this pull request to the merge queue Apr 22, 2026
Merged via the queue into newton-physics:main with commit 329f31c Apr 23, 2026
25 checks passed
shi-eric added a commit to shi-eric/warp that referenced this pull request Apr 23, 2026
Replace the silent fold/pad in _normalize_launch_dim with a hard
ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel,
*, tiled=False) which canonicalizes dim, bypasses validation for
kernels with no wp.tid() call (flattening any shape to total thread
count via math.prod), and raises ValueError with a concrete migration
hint listing 2-4 valid fix options computed from the specific call.

Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via
_construct_tiled_bounds, and the experimental JAX FFI paths
(custom_call.py, ffi.py). Remove _normalize_launch_dim entirely.
Launch.set_dim() preserves tiled semantics on recorded tiled launches.

The launch_tiled rank-mismatch error type changes from RuntimeError
to ValueError for consistency with the regular launch path; kernels
that do not call wp.tid() continue to accept any dim.

Previously excess dimensions were silently folded into the last slot,
which could cause out-of-bounds indexing and heap corruption in
downstream kernels — the trigger took a few hours of debugging to
isolate in Newton (newton-physics/newton#2546) because the overflow
accumulates until an allocator validator notices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Shi <ershi@nvidia.com>
shi-eric added a commit to shi-eric/warp that referenced this pull request Apr 23, 2026
Replace the silent fold/pad in _normalize_launch_dim with a hard
ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel,
*, tiled=False) which canonicalizes dim, bypasses validation for
kernels with no wp.tid() call (flattening any shape to total thread
count via math.prod), and raises ValueError with a concrete migration
hint listing 2-4 valid fix options computed from the specific call.

Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via
_construct_tiled_bounds, and the experimental JAX FFI paths
(custom_call.py, ffi.py). Remove _normalize_launch_dim entirely.
Launch.set_dim() preserves tiled semantics on recorded tiled launches.

The launch_tiled rank-mismatch error type changes from RuntimeError
to ValueError for consistency with the regular launch path; kernels
that do not call wp.tid() continue to accept any dim.

Previously excess dimensions were silently folded into the last slot,
which could cause out-of-bounds indexing and heap corruption in
downstream kernels — the trigger took a few hours of debugging to
isolate in Newton (newton-physics/newton#2546) because the overflow
accumulates until an allocator validator notices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Shi <ershi@nvidia.com>
shi-eric added a commit to shi-eric/warp that referenced this pull request Apr 23, 2026
Replace the silent fold/pad in _normalize_launch_dim with a hard
ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel,
*, tiled=False) which canonicalizes dim, bypasses validation for
kernels with no wp.tid() call (flattening any shape to total thread
count via math.prod), and raises ValueError with a concrete migration
hint listing 2-4 valid fix options computed from the specific call.

Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via
_construct_tiled_bounds, and the experimental JAX FFI paths
(custom_call.py, ffi.py). Remove _normalize_launch_dim entirely.
Launch.set_dim() preserves tiled semantics on recorded tiled launches.

The launch_tiled rank-mismatch error type changes from RuntimeError
to ValueError for consistency with the regular launch path; kernels
that do not call wp.tid() continue to accept any dim.

Previously excess dimensions were silently folded into the last slot,
which could cause out-of-bounds indexing and heap corruption in
downstream kernels — the trigger took a few hours of debugging to
isolate in Newton (newton-physics/newton#2546) because the overflow
accumulates until an allocator validator notices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Shi <ershi@nvidia.com>
shi-eric added a commit to shi-eric/warp that referenced this pull request Apr 23, 2026
Replace the silent fold/pad in _normalize_launch_dim with a hard
ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel,
*, tiled=False) which canonicalizes dim, bypasses validation for
kernels with no wp.tid() call (flattening any shape to total thread
count via math.prod), and raises ValueError with a concrete migration
hint listing 2-4 valid fix options computed from the specific call.

Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via
_construct_tiled_bounds, and the experimental JAX FFI paths
(custom_call.py, ffi.py). Remove _normalize_launch_dim entirely.
Launch.set_dim() preserves tiled semantics on recorded tiled launches.

The launch_tiled rank-mismatch error type changes from RuntimeError
to ValueError for consistency with the regular launch path; kernels
that do not call wp.tid() continue to accept any dim.

Previously excess dimensions were silently folded into the last slot,
which could cause out-of-bounds indexing and heap corruption in
downstream kernels — the trigger took a few hours of debugging to
isolate in Newton (newton-physics/newton#2546) because the overflow
accumulates until an allocator validator notices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Shi <ershi@nvidia.com>
shi-eric added a commit to shi-eric/warp that referenced this pull request Apr 23, 2026
Replace the silent fold/pad in _normalize_launch_dim with a hard
ValueError on rank mismatch. Introduce _prepare_launch_dim(dim, kernel,
*, tiled=False) which canonicalizes dim, bypasses validation for
kernels with no wp.tid() call (flattening any shape to total thread
count via math.prod), and raises ValueError with a concrete migration
hint listing 2-4 valid fix options computed from the specific call.

Wire into wp.launch(), Launch.set_dim(), wp.launch_tiled() via
_construct_tiled_bounds, and the experimental JAX FFI paths
(custom_call.py, ffi.py). Remove _normalize_launch_dim entirely.
Launch.set_dim() preserves tiled semantics on recorded tiled launches.

The launch_tiled rank-mismatch error type changes from RuntimeError
to ValueError for consistency with the regular launch path; kernels
that do not call wp.tid() continue to accept any dim.

Previously excess dimensions were silently folded into the last slot,
which could cause out-of-bounds indexing and heap corruption in
downstream kernels — the trigger took a few hours of debugging to
isolate in Newton (newton-physics/newton#2546) because the overflow
accumulates until an allocator validator notices.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Eric Shi <ershi@nvidia.com>
preist-nvidia added a commit to preist-nvidia/newton that referenced this pull request May 4, 2026
The pre-release audit found user-facing commits that landed since v1.1.0
without an [Unreleased] entry. Backfill five of them here. Kamino-related
commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is
still alpha and not currently tracked in CHANGELOG.

- Changed: contact-conversion overhead reduction in SolverMuJoCo
  ([newton-physicsGH-2393](newton-physics#2393))
- Fixed: target_voxel_size ignored on the texture-SDF path
  ([newton-physicsGH-2464](newton-physics#2464))
- Fixed: Newton-to-mujoco-warp material-combination mismatch
  ([newton-physicsGH-2439](newton-physics#2439))
- Fixed: eq_objtype mismatch on joint equality / mimic constraints in
  SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459))
- Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13
  ([newton-physicsGH-2546](newton-physics#2546))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants