Improve performance of atomic load in toplevel code#47578
Conversation
|
It was somewhat intended. I wanted to move it before every CallInstr, instead of doing it after the calls or ccalls, so that it was synchronized with any other atomic barriers written by the user |
2277af7 to
fe9f220
Compare
I implemented that suggestion, @vtjnash. |
fe9f220 to
59dc30f
Compare
|
Is the linux32 GC corruption known? I don't see how it could be related... As this is a simple performance fix for an issue that could otherwise trip up people benchmarking code in the REPL (which seems like a common thing to do), I think it would be good to back-port this to 1.8 |
|
It might be an OOM error showing up as something else. Does retrying fix it? |
|
No, it seems to happen every time... |
59dc30f to
edcc9d2
Compare
|
After the rebase, CI looks better. |
|
The 32bit windows failure is still there, as well as on other PRs rebased ontop of this (as an example https://github.com/JuliaLang/julia/commits/vc/vtune) |
As noted in #47561, the performance of toplevel code is pretty bad because of the atomic barrier (loading the global world age into the task-local one) that's emitted after every instruction now. @vtjnash noted that monotonic ordering is sufficient, which recovers some performance (1.5s -> 0.7s). But the biggest improvement comes from not emitting the atomic operation between every statement, but only when we perform a call.
Fixes #47561