Iterate elliptic-cone JTCJ over contact support pairs#1411
Closed
adenzler-nvidia wants to merge 1 commit into
Closed
Iterate elliptic-cone JTCJ over contact support pairs#1411adenzler-nvidia wants to merge 1 commit into
adenzler-nvidia wants to merge 1 commit into
Conversation
The sparse elliptic-cone Hessian assembly (JTCJ) previously ran output-stationary: one thread per (contact, dense dof-pair), scanning the contact's column indices to locate each pair. When nefc >> nv the overwhelming majority of those dof-pairs do not appear in the contact's support, so the kernel spent almost all of its time scanning and skipping. Restructure it to be input-stationary: launch one thread per (contact, support-pair), decode the pair index directly into the two participating dofs, register-sum the cone block, and accumulate into H with a single atomic add. This touches only the pairs that actually contribute and exposes far more parallelism. The pair dimension is bounded by jtcj_max_pairs, derived once in io.py from the deepest geom-body dof chain. The math is unchanged.
thowell
reviewed
Jun 5, 2026
| rowadr0 = efc_J_rowadr_in[worldid, efcid0] | ||
| pos1 = int(0) | ||
| rem = pairid | ||
| while rem >= rownnz - pos1: |
Collaborator
There was a problem hiding this comment.
do we save some computations with something like the following?
dif = rownnz - pos1
while rem >= dif
rem -= dif
...
dif = rownnz - pos1
Collaborator
There was a problem hiding this comment.
fyi i tested this and didn't see a speedup on clutter, so leaving as is
thowell
approved these changes
Jun 5, 2026
thowell
left a comment
Collaborator
There was a problem hiding this comment.
this is AWESOME
thanks @adenzler-nvidia!
Merged
Collaborator
|
#1413 is merged - thanks again @adenzler-nvidia |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
update_gradient_JTCJ_sparseassembled the elliptic-cone Hessian output-stationary: one thread per (contact, dense dof-pair), scanning the contact'scolindto locate each pair. Withnefc >> nvalmost every dof-pair is absent from a given contact's support, so ~99% of the work was scan-and-skip.This makes it input-stationary — one thread per (contact, support-pair), decoding the pair index directly into its two dofs, register-summing the cone block, and accumulating into H with a single atomic add (which also fixes a latent write race in the old
+=). The pair count is bounded by a newjtcj_max_pairs, computed once input_modelfrom MuJoCo's mass-matrix sparsity. The math is unchanged; results are bit-identical to the old kernel (differences ~1e-4, from float atomic ordering).End-to-end, elliptic cone, vs main:
The JTCJ kernel alone drops from ~6.2 ms to ~140 µs (~44×) on aloha; gains scale with contact count. The dense (small-
nv) and island paths are untouched. Existing elliptic solver/forward tests pass.