[Newton] Add capture safety guards and fix WrenchComposer stale COM pose#4779
Conversation
- Add capture_unsafe decorator; guard 8 Tier 2 derived properties in ArticulationData and RigidObjectData to prevent silent stale reads during CUDA graph capture - Fix WrenchComposer caching body_com_pose_w at init (Tier 2, never refreshed); replace with Tier 1 body_link_pose_w + body_com_pos_b and inline COM pose computation in wrench kernels - Consolidate resolve_1d_mask into shared utility with capture guards; replace duplicated implementations in ArticulationData and ManagerBasedEnvWarp - Remove @warp_capturable(False) from event functions now that the underlying write paths are capture-safe
Greptile SummaryThis PR addresses two correctness issues in the Newton warp-first backend and introduces supporting infrastructure for CUDA graph capture safety. Key changes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant WrenchComposer
participant resolve_1d_mask
participant WarpKernel as add_forces_and_torques_at_position
participant _com_pose_from_link
Caller->>WrenchComposer: add_forces_and_torques(forces, env_ids, ...)
WrenchComposer->>resolve_1d_mask: ids=env_ids, all_mask, scratch_mask
resolve_1d_mask-->>WrenchComposer: env_mask (wp.array[bool])
WrenchComposer->>resolve_1d_mask: ids=body_ids, all_mask, scratch_mask
resolve_1d_mask-->>WrenchComposer: body_mask (wp.array[bool])
WrenchComposer->>WarpKernel: launch(env_mask, body_mask, forces, body_link_pose_w, body_com_pos_b, ...)
note over WarpKernel: For each (env, body) where masks are True
WarpKernel->>_com_pose_from_link: link_pose_w[env,body], com_pos_b[env,body]
_com_pose_from_link-->>WarpKernel: com_pose_w (inline, no alloc)
WarpKernel->>WarpKernel: cast force/torque to COM frame
WarpKernel-->>WrenchComposer: composed_forces_b, composed_torques_b updated
Last reviewed commit: add76e8 |
| if wp.get_device().is_capturing: | ||
| raise RuntimeError( | ||
| "ObservationManager.reset requires env_mask(wp.array[bool]) during capture. " | ||
| "Do not pass env_ids on captured paths." | ||
| ) |
There was a problem hiding this comment.
Incorrect error message in capture guard
The RuntimeError message here is copy-pasted from WrenchComposer.reset and does not accurately describe the actual failure. This code is inside the observation compute path (not ObservationManager.reset), and the problem is that the circular buffer itself is not capture-safe — it's unrelated to env_ids. A developer hitting this error would likely be confused about what to fix.
| if wp.get_device().is_capturing: | |
| raise RuntimeError( | |
| "ObservationManager.reset requires env_mask(wp.array[bool]) during capture. " | |
| "Do not pass env_ids on captured paths." | |
| ) | |
| if wp.get_device().is_capturing: | |
| raise RuntimeError( | |
| "Observation terms with history (circular buffer) are not CUDA-graph-capture-safe. " | |
| "Disable history for observation terms used inside a captured graph, or restructure " | |
| "the graph to exclude history-buffered terms." | |
| ) |
| elif isinstance(ids, wp.array): | ||
| if ids.shape[0] == 0: | ||
| return scratch_mask | ||
| ids_wp = ids |
There was a problem hiding this comment.
Missing dtype validation for wp.array ids — regression from old implementation
When ids is a wp.array with a dtype other than wp.int32, the code assigns it to ids_wp without any type check and immediately passes it to _populate_mask_from_ids, which expects wp.array(dtype=wp.int32). Warp will then emit a low-level type-mismatch error at kernel launch instead of a helpful TypeError.
The previous implementation in manager_based_env_warp.py explicitly validated this:
if env_ids.dtype == wp.int32:
...
raise TypeError(
f"Unsupported wp.array dtype for env_ids: {env_ids.dtype}. Expected wp.int32 indices or wp.bool mask."
)This validation was lost during the consolidation. Consider adding it back:
| elif isinstance(ids, wp.array): | |
| if ids.shape[0] == 0: | |
| return scratch_mask | |
| ids_wp = ids | |
| elif isinstance(ids, wp.array): | |
| if ids.shape[0] == 0: | |
| return scratch_mask | |
| if ids.dtype != wp.int32: | |
| raise TypeError( | |
| f"Unsupported wp.array dtype for ids: {ids.dtype}. Expected wp.int32 index array." | |
| ) | |
| ids_wp = ids |
|
@greptileai Full review again. |
| if wp.get_device().is_capturing: | ||
| # The only capturable path is | ||
| # 1. both ids and mask are None. | ||
| # 2. mask is a wp.array | ||
| # Note for case 1, the function needs to be consistently calling with both None to avoid launch problem as | ||
| # capture relies on fixed memory address of the input parameters. | ||
| if ids is not None or (mask is not None and not isinstance(mask, wp.array)): | ||
| raise RuntimeError( | ||
| "resolve_1d_mask is only capturable when mask is a wp.array or both ids and mask are None." | ||
| ) |
There was a problem hiding this comment.
slice(None) incorrectly rejected during CUDA graph capture
slice(None) is semantically identical to ids=None ("select all elements"), and the safe resolution for this path is to return all_mask — a stable warp array pointer that is capture-compatible. However, the capture guard fires first because slice(None) is not None evaluates to True, raising a RuntimeError even though no unsafe dynamic allocation would occur.
The normalization that handles this case correctly is on line 281:
if ids is None or (isinstance(ids, slice) and ids == slice(None)):
return all_mask…but it is unreachable during capture. To fix this, normalize slice(None) to None before the capture guard:
# Normalize slice(None) so the capture guard treats it identically to ids=None
if isinstance(ids, slice) and ids == slice(None):
ids = None
if wp.get_device().is_capturing:
if ids is not None or (mask is not None and not isinstance(mask, wp.array)):
raise RuntimeError(
"resolve_1d_mask is only capturable when mask is a wp.array or both ids and mask are None."
)Any caller that passes ids=slice(None) during a captured graph will currently get a spurious error and need to work around it by passing None instead — which may not be obvious.
1) Event fix VS baseline torchReward
* env 10 uses repeat avg (r01-r03) from regress-test-env-10. The reward varies a lot thus not meaningful. Time
2) Event fix VS without event fixReward
* env 10 uses repeat avg (r01-r03) from regress-test-env-10 Time
|
add76e8 to
387c52a
Compare
|
https://github.com/greptileai Full review again. |
|
Looks awesome! + @AntoineRichard for viz on wrench composer changes |
…on (#4945) ## Summary * Cherry-picks [Newton] Migrate more envs and mdps to warp (#4690) onto develop * Cherry-picks [Newton] Add capture safety guards and fix WrenchComposer stale COM pose (#4779) onto develop ### Changes included - Warp-first MDP terms (observations, rewards, events, terminations, actions) for manager-based envs - Tested warp env configs: Ant, Humanoid, Cartpole, locomotion velocity (A1, AnymalB/C/D, Cassie, G1, Go1/2, H1), Franka/UR10 reach - ManagerCallSwitch max_mode cap and scene capture config - MDP kernels made graph-capturable with consolidated test infrastructure - capture_unsafe safety guards on lazy-evaluated derived properties in articulation/rigid_object data - WrenchComposer fix: use fresh COM pose buffers instead of stale cached link poses ### Dropped - G1-29-DOF warp env (Isaac-Velocity-Flat-G1-Warp-v1): removed because the stable g1_29_dofs task config does not exist on develop (only on dev/newton). Warp env PRs should only add warp frontends for envs that already exist in the stable package. ## Dependencies Must be merged **after** these PRs (in order): 1. #4905 (merged) 2. #4829 ## Validated base Validated against develop at 7588fa9. ## Test plan - [x] Run warp env training sweep across all manager-based env configs (14/14 pass, mode=2, 4096 envs, 300 iters) - [ ] Run test_mdp_warp_parity.py and test_mdp_warp_parity_new_terms.py - [ ] Run test_action_warp_parity.py - [ ] Verify WrenchComposer COM pose is fresh (not stale) during graph replay --------- Co-authored-by: Antoine Richard <antoiner@nvidia.com> Co-authored-by: Kelly Guo <kellyg@nvidia.com>
…on (isaac-sim#4945) ## Summary * Cherry-picks [Newton] Migrate more envs and mdps to warp (isaac-sim#4690) onto develop * Cherry-picks [Newton] Add capture safety guards and fix WrenchComposer stale COM pose (isaac-sim#4779) onto develop ### Changes included - Warp-first MDP terms (observations, rewards, events, terminations, actions) for manager-based envs - Tested warp env configs: Ant, Humanoid, Cartpole, locomotion velocity (A1, AnymalB/C/D, Cassie, G1, Go1/2, H1), Franka/UR10 reach - ManagerCallSwitch max_mode cap and scene capture config - MDP kernels made graph-capturable with consolidated test infrastructure - capture_unsafe safety guards on lazy-evaluated derived properties in articulation/rigid_object data - WrenchComposer fix: use fresh COM pose buffers instead of stale cached link poses ### Dropped - G1-29-DOF warp env (Isaac-Velocity-Flat-G1-Warp-v1): removed because the stable g1_29_dofs task config does not exist on develop (only on dev/newton). Warp env PRs should only add warp frontends for envs that already exist in the stable package. ## Dependencies Must be merged **after** these PRs (in order): 1. isaac-sim#4905 (merged) 2. isaac-sim#4829 ## Validated base Validated against develop at 7588fa9. ## Test plan - [x] Run warp env training sweep across all manager-based env configs (14/14 pass, mode=2, 4096 envs, 300 iters) - [ ] Run test_mdp_warp_parity.py and test_mdp_warp_parity_new_terms.py - [ ] Run test_action_warp_parity.py - [ ] Verify WrenchComposer COM pose is fresh (not stale) during graph replay --------- Co-authored-by: Antoine Richard <antoiner@nvidia.com> Co-authored-by: Kelly Guo <kellyg@nvidia.com>
…on (isaac-sim#4945) ## Summary * Cherry-picks [Newton] Migrate more envs and mdps to warp (isaac-sim#4690) onto develop * Cherry-picks [Newton] Add capture safety guards and fix WrenchComposer stale COM pose (isaac-sim#4779) onto develop ### Changes included - Warp-first MDP terms (observations, rewards, events, terminations, actions) for manager-based envs - Tested warp env configs: Ant, Humanoid, Cartpole, locomotion velocity (A1, AnymalB/C/D, Cassie, G1, Go1/2, H1), Franka/UR10 reach - ManagerCallSwitch max_mode cap and scene capture config - MDP kernels made graph-capturable with consolidated test infrastructure - capture_unsafe safety guards on lazy-evaluated derived properties in articulation/rigid_object data - WrenchComposer fix: use fresh COM pose buffers instead of stale cached link poses ### Dropped - G1-29-DOF warp env (Isaac-Velocity-Flat-G1-Warp-v1): removed because the stable g1_29_dofs task config does not exist on develop (only on dev/newton). Warp env PRs should only add warp frontends for envs that already exist in the stable package. ## Dependencies Must be merged **after** these PRs (in order): 1. isaac-sim#4905 (merged) 2. isaac-sim#4829 ## Validated base Validated against develop at 7588fa9. ## Test plan - [x] Run warp env training sweep across all manager-based env configs (14/14 pass, mode=2, 4096 envs, 300 iters) - [ ] Run test_mdp_warp_parity.py and test_mdp_warp_parity_new_terms.py - [ ] Run test_action_warp_parity.py - [ ] Verify WrenchComposer COM pose is fresh (not stale) during graph replay --------- Co-authored-by: Antoine Richard <antoiner@nvidia.com> Co-authored-by: Kelly Guo <kellyg@nvidia.com>
Summary
capture_unsafedecorator; guard 8 Tier 2 derived properties in ArticulationData and RigidObjectData to prevent silent stale reads during CUDA graph capturebody_com_pose_wat init (Tier 2, never refreshed); replace with Tier 1body_link_pose_w+body_com_pos_band inline COM pose in wrench kernelsresolve_1d_maskinto shared utility with capture guards; replace duplicated implementations in ArticulationData and ManagerBasedEnvWarp@warp_capturable(False)from event functions now that underlying write paths are capture-safeTest plan
env-ids=0,1,2,3,5,6,7,8,9,10,11,12,13,14 default=2capture_unsafeRuntimeError during training (all warp MDP terms use Tier 1 data)is_global=Truewrench path uses fresh COM poses (manual test)