Improve FMA usage in laqr5#681
Conversation
Rearrange the application of the Householder reflector
to save one instruction per dot product if FMA is
available.
The update from the right, H * (I - tau * v * v**T),
for example, changes from
H - (tau * (H * v)) * v**T
to
H - (H * (v * tau)) * v**T.
The instruction savings are due to the special structure
of v, whose first component is implicitly one (and used
for storing tau).
Codecov Report
@@ Coverage Diff @@
## master #681 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 1894 1894
Lines 184062 184140 +78
=======================================
- Misses 184062 184140 +78
Continue to review full report at Codecov.
|
thijssteel
left a comment
There was a problem hiding this comment.
Nice PR. Wouldn't surprise me if these small changes result in a noticeable speedup.
| H( K+1, K+1 ) = H( K+1, K+1 ) - REFSUM | ||
| H( K+2, K+1 ) = H( K+2, K+1 ) - REFSUM*V( 2, M ) | ||
| H( K+3, K+1 ) = H( K+3, K+1 ) - REFSUM*V( 3, M ) | ||
| T1 = CONJG( V( 1, M ) ) |
There was a problem hiding this comment.
Since this is only a single column, this probably doesn't actually add anything performance wise, but I like it for consistency.
There was a problem hiding this comment.
Good spot, Thijs, I missed that one. I will upload a revision in the few days. My experiments show an improvement for small problems; less flops are good for accuracy and performance. Let me use that opportunity to say thank you for your contribution with the optimal bulge packing. It's great work :-)
Description
Rearrange the application of the Householder reflector to save one instruction per dot product if fused-multiply-add is available. The proposed compute pattern is already realized by dlahqr.