Alert when `wp.launch()` dim has more dimensions than the kernel's `wp.tid()` unpacks

### Bug Description

### Background

[GH-1270](https://github.com/NVIDIA/warp/issues/1270) templated `launch_bounds_t<N>` on a per-kernel `kernel_dim` inferred from how `wp.tid()` is unpacked in the kernel AST. As a side effect, when the user calls `wp.launch()` with a `dim=` whose rank exceeds `kernel_dim`, Warp's [`_normalize_launch_dim`](warp/_src/context.py#L8111-L8130) silently folds the excess dimensions into the last retained slot. For scalar-`tid` kernels this collapses to a flat linear index; for 2-D kernels it collapses trailing dims into the second slot; etc.

The CHANGELOG calls this out as a breaking change and suggests `i, _ = wp.tid()` as the migration pattern for users who relied on the old unravel behavior. The behavior is deliberate and documented.

However, the silent fold is a real footgun. See [newton-physics/newton#2546](https://github.com/newton-physics/newton/pull/2546) for a concrete instance: a kernel used `tid = wp.tid()` with `dim=joint_f.shape` (a 3-tuple). Under the old semantics this dispatched `nworld * 1 * ndofs` threads with `tid = coord.i` ∈ `[0, nworld)` repeated — redundant but idempotent, so it silently "worked." Under the new semantics Warp flattens the launch to `nworld * 1 * ndofs` threads with `tid` as the flat linear index, so `joint_q[tid, 0, 0]` walks past the first-dim bound of the 3-D array. The result is intermittent access violations, heap corruption detected later inside `wp_lookup` when the damage lands on NT-heap LFH metadata, or silent numerical corruption when the damage lands in a neighbouring Warp array. This was diagnosed over several days with faulthandler traces that kept pointing at detection sites rather than the corruption site, because the overflow accumulates until the Windows heap validator notices.

The important observation from that debug session: **Newton never actually wanted the fold behavior**. It wanted first-dim indexing only, accidentally got correct but repeated indexes under old semantics via the unravel form, and had no way to notice when the semantics changed except through hard-to-reproduce memory-corruption crashes downstream. The silent fold hid the migration mistake long enough to corrupt memory.

### The four valid user intents for `dim=shape` where `shape=(3, 3)`

Given a launch shape of `(3, 3)`, there are four distinct thread-ID patterns a user might want:

| Intent (thread IDs produced) | Correct usage |
|---|---|
| `(0,0), (0,1), (0,2), (1,0), …, (2,2)` — full 2-D work | `i, j = wp.tid()` with `dim=shape` |
| `0, 0, 0, 1, 1, 1, 2, 2, 2` — first-dim only, with repetition | `i, _ = wp.tid()` with `dim=shape` |
| `0, 1, 2, …, 8` — flat linear across all 9 threads | `i = wp.tid()` with `dim=math.prod(shape)` |
| `0, 1, 2` — first-dim only, no repetition | `i = wp.tid()` with `dim=shape[0]` |

All four are expressible cleanly without the implicit fold. The ambiguous form `i = wp.tid()` + `dim=(3, 3)` currently produces the third pattern silently and is the least likely intended one if the old behavior worked correctly for them but was unintentionally slower.

### Options to address

1. **Warn, don't error.** Keep the current silent fold, but emit a `UserWarning` at launch time whenever `ndim(dim) > kernel_dim`. Zero-cost migration for code that genuinely wants the fold; users who hit it by accident see the warning and fix it. Weakest remedy — warnings are easy to miss in noisy log output, and in the Newton case the first visible symptom was a heap-corruption crash with no obvious connection to the kernel at fault.

2. **Error, with an explicit opt-in.** Raise `ValueError` by default when `ndim(dim) > kernel_dim`, and add a way to say "I actually want the fold" — either a `flatten=True` kwarg on `wp.launch()`, a helper `wp.fold_dim(shape, kernel_dim)`, or a `wp.config.allow_dim_flatten` flag. Preserves the documented behavior for users who want it; forces a deliberate choice.

3. **Error, no opt-in.** Raise `ValueError` unconditionally. Users who want the flatten behavior rewrite the call with an explicit scalar or manually-folded shape (`dim=math.prod(shape)` for `kernel_dim=1`, `dim=(shape[0], shape[1]*shape[2])` for partial folds, etc.). No new API surface; no silent footgun. Users genuinely relying on the fold pay a one-time migration cost.

### Recommendation

Option 3 — hard error, no opt-in. The Newton bug illustrates exactly why the silent fold is dangerous: a caller who thought they were indexing the first dimension of a 3-D array got the linear-index behavior instead, exceeding the shape dimensions, and the only feedback channel was a delayed memory-corruption crash. That caller did not want the fold; they wanted the first-dim-only pattern (row 4 of the table above) and had always wanted it. They accidentally got the old unravel behavior (row 2) for free, and when the semantics flipped to the fold behavior (row 3), there was no signal that anything had changed until production crashes started rolling in.

For genuine fold use cases — rare, and trivially rewritable as a scalar or partially-folded shape — the user is better served by spelling it out at the call site. The fold ambiguity for `kernel_dim ≥ 2` (which dims collapse into which) is itself a small footgun that benefits from being made explicit.

Option 2's opt-in kwarg addresses the same problem but adds API bureaucracy for a corner case. If the friction from option 3 turns out to be larger than expected, adding an opt-in later is a strictly additive change — it can be bolted on without further breakage.

Intent (thread IDs produced)	Correct usage
`(0,0), (0,1), (0,2), (1,0), …, (2,2)` — full 2-D work	`i, j = wp.tid()` with `dim=shape`
`0, 0, 0, 1, 1, 1, 2, 2, 2` — first-dim only, with repetition	`i, _ = wp.tid()` with `dim=shape`
`0, 1, 2, …, 8` — flat linear across all 9 threads	`i = wp.tid()` with `dim=math.prod(shape)`
`0, 1, 2` — first-dim only, no repetition	`i = wp.tid()` with `dim=shape[0]`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert when `wp.launch()` dim has more dimensions than the kernel's `wp.tid()` unpacks #1389

Bug Description

Background

The four valid user intents for `dim=shape` where `shape=(3, 3)`

Options to address

Recommendation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Alert when wp.launch() dim has more dimensions than the kernel's wp.tid() unpacks #1389

Description

Bug Description

Background

The four valid user intents for dim=shape where shape=(3, 3)

Options to address

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Alert when `wp.launch()` dim has more dimensions than the kernel's `wp.tid()` unpacks #1389

The four valid user intents for `dim=shape` where `shape=(3, 3)`