Skip to content

Optimization: qderiv_actuator_passive_actuation#1243

Merged
thowell merged 8 commits into
google-deepmind:mainfrom
Kenny-Vilella:dev/kvilella/optimize_qderiv_actuator_passive_actuation
Mar 21, 2026
Merged

Optimization: qderiv_actuator_passive_actuation#1243
thowell merged 8 commits into
google-deepmind:mainfrom
Kenny-Vilella:dev/kvilella/optimize_qderiv_actuator_passive_actuation

Conversation

@Kenny-Vilella

Copy link
Copy Markdown
Collaborator

Optimized the qderiv_actuator_passive_actuation kernel in the sparse case.
It was taking around 10% of the time in the g1 environment in newton, it is now almost negligible.

Some benchmark numbers from Newton using menagerie asset:

Robot Unit Worlds DoFs Bodies This branch (ms) main (ms) delta (%)
go2 x2 8192 36 26 79.1 81.5 -3%
g1 x1 8192 49 44 103.1 112.6 -8%
t1 x3 4096 87 72 57.5 67.3 -15%
g1 x2 8192 98 88 180.0 217.3 -17%
apollo x3 8192 114 108 103.2 139.7 -26%
humanoid x5 8192 135 65 152.4 161.8 -6%

For Mjwarp benchmarks, I do not see any measurable perf difference.

@Kenny-Vilella Kenny-Vilella left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not yet totally sure about the code for complex coupled actuators.
Added a unit tests, but wonder if threre is more complex use cases that should also be tested.

Comment thread mujoco_warp/_src/derivative.py Outdated
Comment thread mujoco_warp/_src/derivative.py Outdated
rowadr = moment_rowadr_in[worldid, actid]

for i in range(rownnz):
dofi = moment_colind_in[worldid, rowadr + i]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it more efficient to create a variable like rowadri = rowadr + i and reuse below?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread mujoco_warp/_src/derivative.py Outdated
Comment thread mujoco_warp/_src/derivative.py Outdated
Comment thread mujoco_warp/_src/derivative.py Outdated
Comment thread mujoco_warp/_src/derivative_test.py Outdated
<flag gravity="disable"/>
</option>
<worldbody>
{body_xml}{close_xml} </worldbody>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this xml string to include the actual strings that are generated above?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
It's not pretty though^^

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikfrey what do we think about the this, do we prefer having the entire xml specified in the string or generating parts of the xml string with code?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha, okay @Kenny-Vilella yeah that's not pretty :-)

here's my order of preference stack ranked from worst to best:

  • giant xml with deep indent embedded in derivative_test.py
  • programmatic string manipulation to create model in derivative_test.py
  • xml with deep indent saved mujoco_warp/test_data
  • programmatic model manipulation using mjspec in derivative_test.py
  • find a smaller XML that tests the same thing (e.g. skip the 35 dofs, just manually set jacobian="sparse" in xml)

@thowell

thowell commented Mar 19, 2026

Copy link
Copy Markdown
Collaborator

@Kenny-Vilella mujoco actuators are independent. nice speedups with this pr!

Comment thread mujoco_warp/_src/derivative.py Outdated

for i in range(rownnz):
rowadri = rowadr + i
dofi = moment_colind_in[worldid, rowadri]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider moving this moment_colind_in read after the continue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread mujoco_warp/_src/derivative.py Outdated

for j in range(i + 1):
rowadrj = rowadr + j
dofj = moment_colind_in[worldid, rowadrj]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider moving this moment_colind_in read after the continue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread mujoco_warp/_src/derivative_test.py Outdated
<flag gravity="disable"/>
</option>
<worldbody>
<body pos="0.1 0 0">

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be indented in the worldbody scope

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it should be fine now?

@thowell

thowell commented Mar 20, 2026

Copy link
Copy Markdown
Collaborator

left a few comments, otherwise lgtm. thanks!

@thowell thowell merged commit 80e146c into google-deepmind:main Mar 21, 2026
10 checks passed
@Kenny-Vilella Kenny-Vilella deleted the dev/kvilella/optimize_qderiv_actuator_passive_actuation branch March 23, 2026 01:09
thowell pushed a commit that referenced this pull request May 6, 2026
…okup table (#1334)

* Fix _qderiv_actuator_passive_actuation_sparse row indexing into qM_fullm

The kernel introduced in #1243 searches qM_fullm for the (row, col) elemid
using m.M_rownnz / m.M_rowadr. Those arrays describe MuJoCo's compact mass
matrix sparsity, where joints whose internal block is treated as
diagonal-only by mj_factorM (e.g. free joints) contribute one entry per
internal dof. qM_fullm_i / qM_fullm_j, however, are built by walking
dof_parentid and include the full chained internal block, so the two
layouts diverge whenever a joint with diagonal-only compact storage
precedes any actuated dof in qvel order.

In that case the kernel reads the wrong slice of qMj, the inner
`qMj[row_startk] == col` check never fires, and the actuator's
contribution to qDeriv is silently dropped. Downstream factor_solve_i
then sees (M - dt*qDeriv) without the actuator damping for the affected
dof, and the implicit step diverges. The bug only surfaces with this
specific topology (free joint at qvel start + actuated dof after it),
which is why the existing serial-chain unit test (PR #1243's
test_smooth_vel_sparse_tendon_coupled, no free joint) does not catch it.

Repro: a single free body followed by an actuated hinge; with the
buggy indexing, mjwarp's deriv_smooth_vel diagonal entry for the hinge
is qM only (0.001), while MuJoCo's reference is qM + dt*(kp+kv) (0.003).

Fix: build chain-aware row offsets qM_fullm_rownnz / qM_fullm_rowadr
alongside qM_fullm_i / qM_fullm_j in io.py and pass them to the kernel
instead of the compact M_rownnz / M_rowadr.

Adds test_smooth_vel_sparse_free_joint_precedes_actuator covering the
minimum reproducer.

* Address review: drop unused body name and is_sparse assert

Per @thowell's PR feedback (#1334), remove the redundant body name
attribute and the is_sparse assertion in the new regression test.

* Switch sparse qDeriv lookup to qM_fullm_elemid (Kenny's approach)

Replace the chain-aware (qM_fullm_rownnz, qM_fullm_rowadr) row offsets
plus linear search through qMj with a dense (nv x nv) qM_fullm_elemid
lookup table built in io.py. The kernel does an O(1) reverse lookup
instead of walking the row to find the matching column.

For typical robot models (humanoid, three_humanoids) kernel time is
unchanged within trial noise. For deep chains with high-rownnz actuators
(e.g. tendons spanning a long serial chain) the kernel asymptotically
drops from O(depth^3) to O(depth^2) per actuator thread; a 100-link
chain with a tendon actuator runs ~8x faster end-to-end on a 5090.

Memory cost: nv^2 * 4 bytes (40 KiB at nv=100, 1 MiB at nv=500).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants