Skip to content

Conversation

@dolio
Copy link
Contributor

@dolio dolio commented Sep 2, 2025

This adds a couple commands which record and output profiling information after running a unison program.

  • run.profiled provides a relatively short output immediately in ucm on completion of the program. The top 25 most costly functions are calculated, and the recorded call tree is filtered to only include those as leaves, which hopefully trims the output to be somewhat reasonable. This is roughly the information you'd be looking for when actually using the full trace.
  • run.profiled.full takes an additional file name argument, and writes all the information to that file. Most of the time, this is a table of the top 25 most costly functions and a full breakdown of all the recorded calls in a tree, showing local and inherited time costs for each function. However, if the file name ends in .ticks or .folded, an alternate file format is used that just records the number of entries. This latter file can be fed directly into flamegraph.pl to produce an SVG flame graph of the profile.

The implementation of the profiler got a little tricky. I came to the conclusion that to get good results, it was necessary to check the profiler almost every iteration through the interpreter, which has performance impacts. So, I created a type class that allows specializing the interpreter loop to the profiler. This means that for the non-profiling loop, the implementation is essentially identical to what we had before, and has no decreased performance. Running with profiling is noticeably slower, by contrast.

I also tried to use some GHC internal mechanisms to get better sampling. When possible, the way the profiler works is to set timeouts, and record a tick when the timeout happens. This means the actual expense of instructions determines how likely it is for a stack to be recorded in the profile (so, e.g. unison functions that do expensive foreign calls should show up more in the profile).

However, it seems that this functionality doesn't exist on Windows. So, instead, I implemented a profiler that periodically calls getCPUTime to decide whether to record a tick. It also uses a counter because calling getCPUTime every iteration seemed to be too expensive (like, another order of magnitude slower than the other profiler). I'm uncertain how this methodology is affected by the relative cost of basic operations, so I used conditional compilation to only use it when the other option doesn't exist.

For functions introduced by compilation, currently no names are shown, just hashes. I'll work on improving that in a follow up PR.

dolio added 8 commits August 26, 2025 19:06
When profiling is turned on, the runtime will record a stack snapshot
approximately every 500 applications. Then after evaluation is complete,
the information will be processed and displayed, either on screen or to a
file.
The new approach uses type class specialization to switch between
a normal interpreter and a profiled interpreter. In the normal
interpreter, all 'profiling' calls are just optimized away.

This removes the periodic yielding from the interpreter, as using
the same system to yield would incur slowdowns in normal operation.
This makes non-allocating loops preemptible, so that they can't starve
other threads. This should make tuning the thread time slice via RTS
flags more reliable, so that manual yielding in the interpreter is no
longer necessary.
@dolio
Copy link
Contributor Author

dolio commented Sep 2, 2025

Oh, one other notable thing about this PR...

Because of the more invasive nature of the profiling ticks, I've removed the manual yield calls from the interpreter. I'm reasonably sure that what that actually did was reduce the time slices of threads running unison code, as well as allowing preemption of tight loops.

Instead, I've added options (-fno-omit-yields) to the build configuration that make GHC always insert potential yields into code. Normally it omits these from non-allocating loops, but the option makes it put them in anyway. This means that arithmetic loops can still be preempted by GHC's scheduler (and it doesn't seem to significantly lower performance). Together with the -C RTS option, it should be possible to get similar behavior to the manual yielding before. Just set a smaller time slice (default is 20ms), and the more reliable preemption should actually ensure that threads only get that time slice.

@aryairani aryairani merged commit ae336e4 into trunk Sep 2, 2025
1 check passed
@aryairani aryairani deleted the topic/profiling branch September 2, 2025 17:38
sellout added a commit to sellout/unison that referenced this pull request Oct 14, 2025
It was previously only based on the cabal.project.freeze file, which
changes very rarely. However, a change to build flags (as happened
in unisonweb#5856) can cause dependencies to need to be rebuilt. But if the cache
key was a perfect match, we wouldn’t cache the updated dependency
builds.

This adds a hash for all Cabal files to the key, so any change that
would affect dependencies causes a cache miss. It also adds the old key
(which is a prefix of the new one) as a restore key, so even when cabal
files change, we load a cache for the correct LTS (GHC version, etc.)
which can prevent cache churn in the period after LTS updates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants