-
Notifications
You must be signed in to change notification settings - Fork 291
Add profiling support to UCM #5856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When profiling is turned on, the runtime will record a stack snapshot approximately every 500 applications. Then after evaluation is complete, the information will be processed and displayed, either on screen or to a file.
The new approach uses type class specialization to switch between a normal interpreter and a profiled interpreter. In the normal interpreter, all 'profiling' calls are just optimized away. This removes the periodic yielding from the interpreter, as using the same system to yield would incur slowdowns in normal operation.
This makes non-allocating loops preemptible, so that they can't starve other threads. This should make tuning the thread time slice via RTS flags more reliable, so that manual yielding in the interpreter is no longer necessary.
|
Oh, one other notable thing about this PR... Because of the more invasive nature of the profiling ticks, I've removed the manual yield calls from the interpreter. I'm reasonably sure that what that actually did was reduce the time slices of threads running unison code, as well as allowing preemption of tight loops. Instead, I've added options ( |
It was previously only based on the cabal.project.freeze file, which changes very rarely. However, a change to build flags (as happened in unisonweb#5856) can cause dependencies to need to be rebuilt. But if the cache key was a perfect match, we wouldn’t cache the updated dependency builds. This adds a hash for all Cabal files to the key, so any change that would affect dependencies causes a cache miss. It also adds the old key (which is a prefix of the new one) as a restore key, so even when cabal files change, we load a cache for the correct LTS (GHC version, etc.) which can prevent cache churn in the period after LTS updates.
This adds a couple commands which record and output profiling information after running a unison program.
run.profiledprovides a relatively short output immediately in ucm on completion of the program. The top 25 most costly functions are calculated, and the recorded call tree is filtered to only include those as leaves, which hopefully trims the output to be somewhat reasonable. This is roughly the information you'd be looking for when actually using the full trace.run.profiled.fulltakes an additional file name argument, and writes all the information to that file. Most of the time, this is a table of the top 25 most costly functions and a full breakdown of all the recorded calls in a tree, showing local and inherited time costs for each function. However, if the file name ends in.ticksor.folded, an alternate file format is used that just records the number of entries. This latter file can be fed directly intoflamegraph.plto produce an SVG flame graph of the profile.The implementation of the profiler got a little tricky. I came to the conclusion that to get good results, it was necessary to check the profiler almost every iteration through the interpreter, which has performance impacts. So, I created a type class that allows specializing the interpreter loop to the profiler. This means that for the non-profiling loop, the implementation is essentially identical to what we had before, and has no decreased performance. Running with profiling is noticeably slower, by contrast.
I also tried to use some GHC internal mechanisms to get better sampling. When possible, the way the profiler works is to set timeouts, and record a tick when the timeout happens. This means the actual expense of instructions determines how likely it is for a stack to be recorded in the profile (so, e.g. unison functions that do expensive foreign calls should show up more in the profile).
However, it seems that this functionality doesn't exist on Windows. So, instead, I implemented a profiler that periodically calls
getCPUTimeto decide whether to record a tick. It also uses a counter because callinggetCPUTimeevery iteration seemed to be too expensive (like, another order of magnitude slower than the other profiler). I'm uncertain how this methodology is affected by the relative cost of basic operations, so I used conditional compilation to only use it when the other option doesn't exist.For functions introduced by compilation, currently no names are shown, just hashes. I'll work on improving that in a follow up PR.