Skip to content

JIT: speed up fgComputePreds#35352

Merged
AndyAyersMS merged 1 commit intodotnet:masterfrom
AndyAyersMS:SpeedUpPredListDupHandling
Apr 24, 2020
Merged

JIT: speed up fgComputePreds#35352
AndyAyersMS merged 1 commit intodotnet:masterfrom
AndyAyersMS:SpeedUpPredListDupHandling

Conversation

@AndyAyersMS
Copy link
Member

Interaction of fgComputePreds and fgAddRefPred could be quadratic in the
number of preds.

Usually the number of preds is small (1 or 2) but in some cases seen from
compiled regular expressions it could be in the thousands. On one such case
a single call to fgComputePreds was taking ~20% of jit time.

Since we build the pred list in sorted order we can take advantage of this
to avoid searching the list for potential duplicates in fgAddRefPred when
it is called from fgComputePreds -- the only possible duplicate entry is
at the end of the list.

This doesn't address perf of subsequent calls to fgAddRefPred but
those happen somewhat randomly and are unlikely to be as costly.

Interaction of `fgComputePreds` and `fgAddRefPred` could be quadratic in the
number of preds.

Usually the number of preds is small (1 or 2) but in some cases seen from
compiled regular expressions it could be in the thousands. On one such case
a single call to fgComputePreds was taking ~20% of jit time.

Since we build the pred list in sorted order we can take advantage of this
to avoid searching the list for potential duplicates in `fgAddRefPred` when
it is called from `fgComputePreds` -- the only possible duplicate entry is
at the end of the list.

This doesn't address perf of subsequent calls to `fgAddRefPred` but likely
those happen somewhat random and are unlikely to be as costly.
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 23, 2020
@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib @stephentoub

No diffs on FX.

Local test case total time drops from 4.5 seconds to 3.6 seconds. Before-after profiles below:

regex

@stephentoub
Copy link
Member

Thanks for working on this, Andy.

@AndyAyersMS
Copy link
Member Author

You may also notice that fgMarkBackwardJump creeping up in importance, this can also do quadratic work.

This method wants to mark an interval [x,y] of blocks. It can be called repeatedly on ever-increasing spans of blocks (eg [x,y], then [x,y+1], then [x, y+2], ...) so for long chains of these we can see quadratic marking cost as we don't have an effective way of determining that subsequent calls may not need to mark the full interval. Of course we can also see partial overlaps, eg [x,y] then [x-5, y-5] but x is a required to be a backedge target and those tend to be sparse. So perhaps remembering for each x the maximum y that is marked might be sufficient to fix the perf issues there. We might also need a fast way to go from bbNum to block; we build this mapping in the importer and so it's probably already set up.

@AndyAyersMS
Copy link
Member Author

Also a bit surprised to see the StackLevelSetter be so costly -- note the method in this test case is a very large method, so walking through the IR doing anything is expensive.

@sandreenko do we need to run this phase for x64? Comments suggest the results for x64 mainly feed into sanity checks in the emitter.

@sandreenko
Copy link
Contributor

Where do you see StackLevelSetter? I can't find it on your screenshot.

@sandreenko do we need to run this phase for x64?

Correct, it is needed only for debug purposes on other platforms, it sets fgPtrArgCntMax that is used under a noway_assert:

unsigned maxAllowedStackDepth = compiler->fgPtrArgCntMax + // Max number of pointer-sized stack arguments.
compiler->compHndBBtabCount + // Return address for locally-called finallys
genTypeStSz(TYP_LONG) + // longs/doubles may be transferred via stack, etc
(compiler->compTailCallUsed ? 4 : 0); // CORINFO_HELP_TAILCALL args
#if defined(UNIX_X86_ABI)
// Convert maxNestedAlignment to DWORD count before adding to maxAllowedStackDepth.
assert(maxNestedAlignment % sizeof(int) == 0);
maxAllowedStackDepth += maxNestedAlignment / sizeof(int);
#endif
noway_assert(GetEmitter()->emitMaxStackDepth <= maxAllowedStackDepth);
}
#endif // EMIT_TRACK_STACK_DEPTH

It is time complexity is O(number of nodes * JitHashTable complexity) that should be O(number of nodes), so I would be surprised if on big methods its percentage is bigger than on small methods. Does it indicate a problem with hashTable?

@AndyAyersMS
Copy link
Member Author

Here's where it shows up ("after" picture) -- about 2.4 % of process time, so maybe 3% of jit time?

image

There's too much inlining going on in release to get a full picture of where the time is spent:

image

@AndyAyersMS
Copy link
Member Author

Thought the stack level setter might be churning through memory, since every insertion does an allocation, but the overall memory usage in this pool is pretty small:

       fgArgInfoPtrArr |      62216 |   0.17%

Turns out there are no PUTARG_STK or PUTARG_SPLIT nodes in the IR for this method, so the hash table is empty; evidently most of those allocations above come from morph.

So it's puzzling why perfview reports so much time in PopArgumentsFromCall given that the method is only invoked on calls, and should quickly discover for each call there is nothing to do.

Similarly it's puzzling why the iterator seems to be showing up like it does.

@AndyAyersMS
Copy link
Member Author

Lots of failures with messages like

##[error]The job running on agent NetCorePublic-Pool 107 ran longer than the maximum time of 60 minutes. 
##[warning]Agent NetCorePublic-Pool 107 did not respond to a cancelation request with 00:05:00.

going to bounce this.

@AndyAyersMS AndyAyersMS reopened this Apr 24, 2020
@AndyAyersMS AndyAyersMS merged commit af36c6d into dotnet:master Apr 24, 2020
@AndyAyersMS AndyAyersMS deleted the SpeedUpPredListDupHandling branch April 24, 2020 19:20
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants