Skip to content

Conversation

@dolio
Copy link
Contributor

@dolio dolio commented Aug 1, 2025

This PR adds a counter to execution of the interpreter. Each time a function application occurs (either fast or slow path), the counter is decremented. When it reaches 0, the interpreter calls Control.Concurrent.yield before continuing. The counter starts at 5,000.

The goal is to avoid certain thread starvation scenarios. GHC's thread preemption is based on memory allocation. If a thread is executing code that does not allocate any memory, the RTS has no way of preempting it. It's actually possible to write loops in unison that cause very little memory to be allocated, meaning that such threads could cause other threads to starve.

The counter itself seems to be free (actually, the altered code seems to be optimized better, making numeric code faster), so the only performance implications are the overhead of the yield, which is determined by the frequency. 5,000 is only a little slower on our benchmarks, but considerably improves latency on a thread starvation benchmark (from seconds to milliseconds).

Benchmarks comparisons vs. the 0.5.44 release. I know it looks like most things got worse, but the release build actually got better timings than my local build from before I made these changes, so maybe my local builds are just a little worse in general. The ones that actually seemed worse are the fibs and (for some reason) CAS on IORefs.

Mutate a Ref 1000 times
147.278µs --> 154.553µs

Mutate a local Remote.Ref 10k times
66.706991ms --> 70.887834ms

Do 10k arithmetic operations
100.687µs --> 101.182µs

List.map increment (range 0 1000)
199.117µs --> 211.367µs

List.map murmurHash (range 0 1000)
659.266µs --> 719.155µs

Multimap.fromList (range 0 1000)
92.674µs --> 93.718µs

Stream functions
8.625797ms --> 9.027248ms

Value.serializeUncompressed (10k element map)
24.309563ms --> 25.521379ms

Value.serializeCompressed (10k element map)
30.743942ms --> 32.187171ms

Value.deserializeCompressed (10k element map)
68.330335ms --> 69.632748ms

Json.toText (per document)
50.165µs --> 51.022µs

Json parsing (per document)
15.585µs --> 14.898µs

Json complex parsing (per document)
20.326µs --> 19.479µs

Json complex decoding (per document)
116.316µs --> 115.395µs

Decode Nat
194ns --> 199ns

Generate 100 random numbers
132.655µs --> 129.236µs

List.foldLeft
1.68029ms --> 1.739753ms

Count to 1 million
41.24927ms --> 47.40688ms

Count to N (per element)
96ns --> 104ns

Count to 1000
96.397µs --> 104.998µs

CAS an IO.ref 1000 times
169.548µs --> 178.61µs

List.range (per element)
60ns --> 56ns

List.range 0 1000
80.869µs --> 79.87µs

Set.fromList (range 0 1000)
40.203µs --> 39.208µs

Map.fromList (range 0 1000)
39.39µs --> 38.721µs

NatMap.fromList (range 0 1000)
3.425246ms --> 3.577727ms

Map.lookup (1k element map)
225ns --> 229ns

Map.insert (1k element map)
339ns --> 344ns

Shuffle a 1000 element array
2.085716ms --> 2.093931ms

Mutably mergesort a 1000 element array
2.778179ms --> 2.788406ms

List.at (1k element list)
192ns --> 201ns

Text.split /
3.566µs --> 3.622µs

Two match
6.362905ms --> 6.738259ms

Four match
6.410241ms --> 6.758093ms

Thirty match
5.887544ms --> 6.216448ms

fib1
188.355µs --> 192.991µs

fib2
472.384µs --> 489.29µs

fib3
475.395µs --> 494.204µs

Comment on lines +767 to +771
| yld <= 0 = do
CNC.yield
apply' yieldSteps env henv activeThreads stk k ck args val
| otherwise =
apply' (yld - 1) env henv activeThreads stk k ck args val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make any difference to swap these around? Put the most common case first.

Suggested change
| yld <= 0 = do
CNC.yield
apply' yieldSteps env henv activeThreads stk k ck args val
| otherwise =
apply' (yld - 1) env henv activeThreads stk k ck args val
| yld > 0 = apply' (yld - 1) env henv activeThreads stk k ck args val
| otherwise = do
CNC.yield
apply' yieldSteps env henv activeThreads stk k ck args val

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to make any difference.

What does make a difference is turning it into something like:

do
  yld <- if yld > 0
           then pure $ yld-1
           else yieldSteps <$ CNC.yield
  apply' yld ...

which is slower for some reason.

Copy link
Member

@pchiusano pchiusano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left one comment. @SystemFw give this a go when you have a chance.

@aryairani
Copy link
Contributor

Should I hold off on merging until @SystemFw gives it a go, or is it okay? It's okay with me.

@pchiusano
Copy link
Member

Okay, we're going to merge this, as it did help for the microbenchark and seems like a good idea, but the specific issue we were encountering isn't fixed yet.

@pchiusano pchiusano merged commit 16cf83b into trunk Aug 4, 2025
31 checks passed
@pchiusano pchiusano deleted the topic/interp-yield branch August 4, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants