JIT: speed up fgComputePreds#35352
Conversation
Interaction of `fgComputePreds` and `fgAddRefPred` could be quadratic in the number of preds. Usually the number of preds is small (1 or 2) but in some cases seen from compiled regular expressions it could be in the thousands. On one such case a single call to fgComputePreds was taking ~20% of jit time. Since we build the pred list in sorted order we can take advantage of this to avoid searching the list for potential duplicates in `fgAddRefPred` when it is called from `fgComputePreds` -- the only possible duplicate entry is at the end of the list. This doesn't address perf of subsequent calls to `fgAddRefPred` but likely those happen somewhat random and are unlikely to be as costly.
|
cc @dotnet/jit-contrib @stephentoub No diffs on FX. Local test case total time drops from 4.5 seconds to 3.6 seconds. Before-after profiles below: |
|
Thanks for working on this, Andy. |
|
You may also notice that This method wants to mark an interval [x,y] of blocks. It can be called repeatedly on ever-increasing spans of blocks (eg [x,y], then [x,y+1], then [x, y+2], ...) so for long chains of these we can see quadratic marking cost as we don't have an effective way of determining that subsequent calls may not need to mark the full interval. Of course we can also see partial overlaps, eg [x,y] then [x-5, y-5] but x is a required to be a backedge target and those tend to be sparse. So perhaps remembering for each x the maximum y that is marked might be sufficient to fix the perf issues there. We might also need a fast way to go from bbNum to block; we build this mapping in the importer and so it's probably already set up. |
|
Also a bit surprised to see the @sandreenko do we need to run this phase for x64? Comments suggest the results for x64 mainly feed into sanity checks in the emitter. |
|
Where do you see StackLevelSetter? I can't find it on your screenshot.
Correct, it is needed only for debug purposes on other platforms, it sets runtime/src/coreclr/src/jit/codegencommon.cpp Lines 2363 to 2375 in a4ae12a It is time complexity is O( |
|
Thought the stack level setter might be churning through memory, since every insertion does an allocation, but the overall memory usage in this pool is pretty small: Turns out there are no So it's puzzling why perfview reports so much time in Similarly it's puzzling why the iterator seems to be showing up like it does. |
|
Lots of failures with messages like going to bounce this. |



Interaction of
fgComputePredsandfgAddRefPredcould be quadratic in thenumber of preds.
Usually the number of preds is small (1 or 2) but in some cases seen from
compiled regular expressions it could be in the thousands. On one such case
a single call to
fgComputePredswas taking ~20% of jit time.Since we build the pred list in sorted order we can take advantage of this
to avoid searching the list for potential duplicates in
fgAddRefPredwhenit is called from
fgComputePreds-- the only possible duplicate entry isat the end of the list.
This doesn't address perf of subsequent calls to
fgAddRefPredbutthose happen somewhat randomly and are unlikely to be as costly.