Raycast sensor optimization#948
Merged
Merged
Conversation
kevinzakka
requested changes
Apr 28, 2026
kevinzakka
left a comment
Collaborator
There was a problem hiding this comment.
This is awesome, thanks @bd-pdomanico! Two small asks before merging:
- Rebase onto the latest main
- Add a changelog entry
Thanks!
Removes boolean-mask indexing operations with `masked_fill_` and an implementation of hit_pos_w that uses a clamped distance to put hit_pos_w for misses at world_origins. These effectively remove all cuda syncs from the ray postprocess, unblocking the cpu thread while gpu-based sensing occurs. Also optimizes quat_from_matrix. Similarly this removes cuda sync operations and catches the implementation up to the latest from https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/transforms/rotation_conversions.py. Profiling now shows: device=NVIDIA GeForce RTX 4090 shape legacy us current us speedup max |Δ| --------------------------------------------------------------- B = 4096 231.26 51.01 4.53x 1.79e-07 B = 16384 243.55 56.07 4.34x 1.79e-07 B = 65536 282.05 53.93 5.23x 1.79e-07
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
861d59d to
fa6231c
Compare
kevinzakka
added a commit
that referenced
this pull request
Apr 28, 2026
The previous mujoco 3.7 nightly was yanked, breaking CI. This bumps both mujoco (to 3.8.1 nightly) and mujoco-warp (to a post-3.8.0 commit that includes the cache_kernel fix from google-deepmind/mujoco_warp#1318). The multiccd enable flag was removed in mujoco 3.8 (it became default-on), so the test that exercised it now uses the energy flag instead. Fixes #948
kevinzakka
added a commit
that referenced
this pull request
Apr 28, 2026
…ipt (#954) The @torch.compile(fullgraph=True) decorator added in #948 hits dynamo's recompile_limit (8) when the full test suite runs, causing test_raycast_sensor.py to fail with FailOnRecompileLimitHit. The shape specialization on the leading batch dim exhausts the budget across the many env constructions that happen during the suite. Switching back to @torch.jit.script avoids dynamo entirely while keeping the pytorch3d-style sync-free implementation rewrite from #948 (which is the actual source of the speedup). Only the decorator changes. Also drops the corresponding changelog entry, since the speedup attribution is now muddier.
sibisibi
pushed a commit
to DAVIAN-Robotics/mjlab
that referenced
this pull request
May 5, 2026
The previous mujoco 3.7 nightly was yanked, breaking CI. This bumps both mujoco (to 3.8.1 nightly) and mujoco-warp (to a post-3.8.0 commit that includes the cache_kernel fix from google-deepmind/mujoco_warp#1318). The multiccd enable flag was removed in mujoco 3.8 (it became default-on), so the test that exercised it now uses the energy flag instead. Fixes mujocolab#948
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Removes boolean-mask indexing operations with
masked_fill_and an implementation of hit_pos_w that uses a clamped distance to put hit_pos_w for misses at world_origins. These effectively remove all cuda syncs from the ray postprocess, unblocking the cpu thread while gpu-based sensing occurs.Also optimizes
quat_from_matrix. Similarly this removes cuda sync operations and catches the implementation up to the latest from https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/transforms/rotation_conversions.py. Profiling now shows:Profiled using:
With both of these changes,
Mjlab-Velocity-Rough-Unitree-Go1trains ~3% faster