JIT: speed up fgComputePreds by AndyAyersMS · Pull Request #35352 · dotnet/runtime

AndyAyersMS · 2020-04-23T17:39:22Z

Interaction of fgComputePreds and fgAddRefPred could be quadratic in the
number of preds.

Usually the number of preds is small (1 or 2) but in some cases seen from
compiled regular expressions it could be in the thousands. On one such case
a single call to fgComputePreds was taking ~20% of jit time.

Since we build the pred list in sorted order we can take advantage of this
to avoid searching the list for potential duplicates in fgAddRefPred when
it is called from fgComputePreds -- the only possible duplicate entry is
at the end of the list.

This doesn't address perf of subsequent calls to fgAddRefPred but
those happen somewhat randomly and are unlikely to be as costly.

Interaction of `fgComputePreds` and `fgAddRefPred` could be quadratic in the number of preds. Usually the number of preds is small (1 or 2) but in some cases seen from compiled regular expressions it could be in the thousands. On one such case a single call to fgComputePreds was taking ~20% of jit time. Since we build the pred list in sorted order we can take advantage of this to avoid searching the list for potential duplicates in `fgAddRefPred` when it is called from `fgComputePreds` -- the only possible duplicate entry is at the end of the list. This doesn't address perf of subsequent calls to `fgAddRefPred` but likely those happen somewhat random and are unlikely to be as costly.

AndyAyersMS · 2020-04-23T17:41:48Z

cc @dotnet/jit-contrib @stephentoub

No diffs on FX.

Local test case total time drops from 4.5 seconds to 3.6 seconds. Before-after profiles below:

stephentoub · 2020-04-23T17:42:56Z

Thanks for working on this, Andy.

AndyAyersMS · 2020-04-23T18:16:39Z

You may also notice that fgMarkBackwardJump creeping up in importance, this can also do quadratic work.

This method wants to mark an interval [x,y] of blocks. It can be called repeatedly on ever-increasing spans of blocks (eg [x,y], then [x,y+1], then [x, y+2], ...) so for long chains of these we can see quadratic marking cost as we don't have an effective way of determining that subsequent calls may not need to mark the full interval. Of course we can also see partial overlaps, eg [x,y] then [x-5, y-5] but x is a required to be a backedge target and those tend to be sparse. So perhaps remembering for each x the maximum y that is marked might be sufficient to fix the perf issues there. We might also need a fast way to go from bbNum to block; we build this mapping in the importer and so it's probably already set up.

AndyAyersMS · 2020-04-23T20:29:32Z

Also a bit surprised to see the StackLevelSetter be so costly -- note the method in this test case is a very large method, so walking through the IR doing anything is expensive.

@sandreenko do we need to run this phase for x64? Comments suggest the results for x64 mainly feed into sanity checks in the emitter.

sandreenko · 2020-04-23T21:11:34Z

Where do you see StackLevelSetter? I can't find it on your screenshot.

@sandreenko do we need to run this phase for x64?

Correct, it is needed only for debug purposes on other platforms, it sets fgPtrArgCntMax that is used under a noway_assert:

runtime/src/coreclr/src/jit/codegencommon.cpp

Lines 2363 to 2375 in a4ae12a

    
                   unsigned maxAllowedStackDepth = compiler->fgPtrArgCntMax +    // Max number of pointer-sized stack arguments. 
        
                                                   compiler->compHndBBtabCount + // Return address for locally-called finallys 
        
                                                   genTypeStSz(TYP_LONG) +       // longs/doubles may be transferred via stack, etc 
        
                                                   (compiler->compTailCallUsed ? 4 : 0); // CORINFO_HELP_TAILCALL args 
        
           #if defined(UNIX_X86_ABI) 
        
                   // Convert maxNestedAlignment to DWORD count before adding to maxAllowedStackDepth. 
        
                   assert(maxNestedAlignment % sizeof(int) == 0); 
        
                   maxAllowedStackDepth += maxNestedAlignment / sizeof(int); 
        
           #endif 
        
                   noway_assert(GetEmitter()->emitMaxStackDepth <= maxAllowedStackDepth); 
        
               } 
        
           #endif // EMIT_TRACK_STACK_DEPTH

It is time complexity is O(number of nodes * JitHashTable complexity) that should be O(number of nodes), so I would be surprised if on big methods its percentage is bigger than on small methods. Does it indicate a problem with hashTable?

AndyAyersMS · 2020-04-23T21:59:09Z

Here's where it shows up ("after" picture) -- about 2.4 % of process time, so maybe 3% of jit time?

There's too much inlining going on in release to get a full picture of where the time is spent:

AndyAyersMS · 2020-04-24T00:55:55Z

Thought the stack level setter might be churning through memory, since every insertion does an allocation, but the overall memory usage in this pool is pretty small:

       fgArgInfoPtrArr |      62216 |   0.17%

Turns out there are no PUTARG_STK or PUTARG_SPLIT nodes in the IR for this method, so the hash table is empty; evidently most of those allocations above come from morph.

So it's puzzling why perfview reports so much time in PopArgumentsFromCall given that the method is only invoked on calls, and should quickly discover for each call there is nothing to do.

Similarly it's puzzling why the iterator seems to be showing up like it does.

AndyAyersMS · 2020-04-24T15:54:25Z

Lots of failures with messages like

##[error]The job running on agent NetCorePublic-Pool 107 ran longer than the maximum time of 60 minutes. 
##[warning]Agent NetCorePublic-Pool 107 did not respond to a cancelation request with 00:05:00.

going to bounce this.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 23, 2020

BruceForstall approved these changes Apr 23, 2020

View reviewed changes

jaredpar mentioned this pull request Apr 23, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

AndyAyersMS mentioned this pull request Apr 23, 2020

Predecessor List does not always maintain sorted order. #8720

Closed

AndyAyersMS closed this Apr 24, 2020

AndyAyersMS reopened this Apr 24, 2020

AndyAyersMS merged commit af36c6d into dotnet:master Apr 24, 2020

AndyAyersMS deleted the SpeedUpPredListDupHandling branch April 24, 2020 19:20

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: speed up fgComputePreds#35352

JIT: speed up fgComputePreds#35352
AndyAyersMS merged 1 commit intodotnet:masterfrom
AndyAyersMS:SpeedUpPredListDupHandling

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

stephentoub commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

sandreenko commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 24, 2020

Uh oh!

AndyAyersMS commented Apr 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

stephentoub commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

sandreenko commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 23, 2020

Uh oh!

AndyAyersMS commented Apr 24, 2020

Uh oh!

AndyAyersMS commented Apr 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants