Skip to content

Alert when wp.launch() dim has more dimensions than the kernel's wp.tid() unpacks #1389

@c0d1f1ed

Description

@c0d1f1ed

Bug Description

Background

GH-1270 templated launch_bounds_t<N> on a per-kernel kernel_dim inferred from how wp.tid() is unpacked in the kernel AST. As a side effect, when the user calls wp.launch() with a dim= whose rank exceeds kernel_dim, Warp's _normalize_launch_dim silently folds the excess dimensions into the last retained slot. For scalar-tid kernels this collapses to a flat linear index; for 2-D kernels it collapses trailing dims into the second slot; etc.

The CHANGELOG calls this out as a breaking change and suggests i, _ = wp.tid() as the migration pattern for users who relied on the old unravel behavior. The behavior is deliberate and documented.

However, the silent fold is a real footgun. See newton-physics/newton#2546 for a concrete instance: a kernel used tid = wp.tid() with dim=joint_f.shape (a 3-tuple). Under the old semantics this dispatched nworld * 1 * ndofs threads with tid = coord.i[0, nworld) repeated — redundant but idempotent, so it silently "worked." Under the new semantics Warp flattens the launch to nworld * 1 * ndofs threads with tid as the flat linear index, so joint_q[tid, 0, 0] walks past the first-dim bound of the 3-D array. The result is intermittent access violations, heap corruption detected later inside wp_lookup when the damage lands on NT-heap LFH metadata, or silent numerical corruption when the damage lands in a neighbouring Warp array. This was diagnosed over several days with faulthandler traces that kept pointing at detection sites rather than the corruption site, because the overflow accumulates until the Windows heap validator notices.

The important observation from that debug session: Newton never actually wanted the fold behavior. It wanted first-dim indexing only, accidentally got correct but repeated indexes under old semantics via the unravel form, and had no way to notice when the semantics changed except through hard-to-reproduce memory-corruption crashes downstream. The silent fold hid the migration mistake long enough to corrupt memory.

The four valid user intents for dim=shape where shape=(3, 3)

Given a launch shape of (3, 3), there are four distinct thread-ID patterns a user might want:

Intent (thread IDs produced) Correct usage
(0,0), (0,1), (0,2), (1,0), …, (2,2) — full 2-D work i, j = wp.tid() with dim=shape
0, 0, 0, 1, 1, 1, 2, 2, 2 — first-dim only, with repetition i, _ = wp.tid() with dim=shape
0, 1, 2, …, 8 — flat linear across all 9 threads i = wp.tid() with dim=math.prod(shape)
0, 1, 2 — first-dim only, no repetition i = wp.tid() with dim=shape[0]

All four are expressible cleanly without the implicit fold. The ambiguous form i = wp.tid() + dim=(3, 3) currently produces the third pattern silently and is the least likely intended one if the old behavior worked correctly for them but was unintentionally slower.

Options to address

  1. Warn, don't error. Keep the current silent fold, but emit a UserWarning at launch time whenever ndim(dim) > kernel_dim. Zero-cost migration for code that genuinely wants the fold; users who hit it by accident see the warning and fix it. Weakest remedy — warnings are easy to miss in noisy log output, and in the Newton case the first visible symptom was a heap-corruption crash with no obvious connection to the kernel at fault.

  2. Error, with an explicit opt-in. Raise ValueError by default when ndim(dim) > kernel_dim, and add a way to say "I actually want the fold" — either a flatten=True kwarg on wp.launch(), a helper wp.fold_dim(shape, kernel_dim), or a wp.config.allow_dim_flatten flag. Preserves the documented behavior for users who want it; forces a deliberate choice.

  3. Error, no opt-in. Raise ValueError unconditionally. Users who want the flatten behavior rewrite the call with an explicit scalar or manually-folded shape (dim=math.prod(shape) for kernel_dim=1, dim=(shape[0], shape[1]*shape[2]) for partial folds, etc.). No new API surface; no silent footgun. Users genuinely relying on the fold pay a one-time migration cost.

Recommendation

Option 3 — hard error, no opt-in. The Newton bug illustrates exactly why the silent fold is dangerous: a caller who thought they were indexing the first dimension of a 3-D array got the linear-index behavior instead, exceeding the shape dimensions, and the only feedback channel was a delayed memory-corruption crash. That caller did not want the fold; they wanted the first-dim-only pattern (row 4 of the table above) and had always wanted it. They accidentally got the old unravel behavior (row 2) for free, and when the semantics flipped to the fold behavior (row 3), there was no signal that anything had changed until production crashes started rolling in.

For genuine fold use cases — rare, and trivially rewritable as a scalar or partially-folded shape — the user is better served by spelling it out at the call site. The fold ambiguity for kernel_dim ≥ 2 (which dims collapse into which) is itself a small footgun that benefits from being made explicit.

Option 2's opt-in kwarg addresses the same problem but adds API bureaucracy for a corner case. If the friction from option 3 turns out to be larger than expected, adding an opt-in later is a strictly additive change — it can be bolted on without further breakage.

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions