Skip to content

GC-preserve a Task's scope on start, to prevent freeing unrooted scopes via ScopedThunk#60620

Closed
NHDaly wants to merge 3 commits intomasterfrom
nhd-59483-unrooted-Task-scope-fix
Closed

GC-preserve a Task's scope on start, to prevent freeing unrooted scopes via ScopedThunk#60620
NHDaly wants to merge 3 commits intomasterfrom
nhd-59483-unrooted-Task-scope-fix

Conversation

@NHDaly
Copy link
Copy Markdown
Member

@NHDaly NHDaly commented Jan 9, 2026

Explanation of the issue in 59483:

When a task enters a new scope, the exception handler explicitly roots the new scope, but not the old scope.
The old scope doesn't need to be rooted, because it was already rooted when it was first entered
BUT, here's the issue:
A task can also inherit a scope from another task, when it's spawned!
In that case, if the parent task leaves the scope, it stops rooting it.
THEN if the child task enters a new scope, no one is rooting its old scope.
AND if the new scope is not a child of the old scope (i.e. via a ScopedThunk), then now no one is rooting the old scope.
So now the old scope can be freed.
And so when the task pops back, there's a crash.

This is now fixed by rooting the Scope when starting a new Task.

Fixes #59483.

@NHDaly NHDaly added the backport 1.12 Change should be backported to release-1.12 label Jan 9, 2026
@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 9, 2026

Note for the backport (@KristofferC): You won't be able to backport the test because ScopedThunk doesn't exist on julia 1.12. You can write an equivalent test using ScopedValues.jl, but i think you could also just drop the test during the backport.

EDIT: @KristofferC I have adjusted the test so it should pass on 1.12 as well. Therefore this should be a clean backport.

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 9, 2026

Does it make more sense to instead preserve the previous scope when we install a new one? It feels a bit odd to double-preserve the current scope.

@vchuravy
Copy link
Copy Markdown
Member

vchuravy commented Jan 9, 2026

Looking at #52309 the interpreter version does have a GC preserve, so maybe we just need to emit a GC preserve in the codegen version?

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 9, 2026

Looking at #52309 the interpreter version does have a GC preserve, so maybe we just need to emit a GC preserve in the codegen version?

Both interpreter and codegen preserve the new scope. However, if a scope was inherited this can leave the initial scope of a task dangling, which is what this PR fixes. My proposal was to switch both interpreter and codegen to preserving the old scope.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 9, 2026

Yeah that seems like it would be a reasonable approach as well, Keno, and it might be easier to understand. (It's what the LLMs wanted to do to fix this 😛)
I have no preference; I'm happy with whatever convention you all feel is best

@JeffBezanson
Copy link
Copy Markdown
Member

I agree it is more intuitive to preserve the old scope. Maybe we didn't do it that way because many times that would just be preserving nothing though?

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 9, 2026

Maybe we didn't do it that way because many times that would just be preserving nothing though?

I think I just wasn't thinking carefully - let's try preserving the old scope - makes more sense to me and lets the optimizer delete it if there are no safepoints.

@vchuravy
Copy link
Copy Markdown
Member

Ah, yeah I missed that detail.

old_scope = current_task.scope
try
    current_task.scope = new_scope # Rooted through task object
finally 
    current_task.scope = old_scope
end

Is the behavior that we implement, and indeed we should only root the old_scope across the try ... finally (and it would have normally if this was Julia code) The new_scope should only be rooted through the task.

I guess the only thing that confuses me is that I would have expected GC lowering to put the old_scope onto the shadow stack?

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

@Keno / @JeffBezanson / @vchuravy: My instinct would be to merge this PR as-is, now, since it is a minimal change, we know it fixes the issue, and it can be easily backported to 1.12 as a bug-fix.

Then, maybe in a separate PR we can make the changes you're suggesting above, which would change the "calling convention" for GC-preserving the scope? I feel like that is a slightly bigger change, and i'd rather make that only on master, rather than backporting it to 1.12 before we've had a chance to identify any unintended consequences of that change / or any cases we've missed.

Does that sound okay to you all? Any objections to merging this one now, and then following up with the proposal above?

EDIT: I'll start a branch now to make the proposed change.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

The new_scope should only be rooted through the task.

E.g. I'm confused about this part: Are you sure we don't also need to root the new task? From what i've read, we don't scan all of the Tasks' scope fields.
Or do we just also need to insert a write-barrier here from the Task to the new scope?
These kinds of details are why i don't want to change this to a new approach in this bugfix PR - i think that should be separate.

EDIT: I think as long as we also do #60617, then it should be okay to only root the old scope. Is that right?

@vchuravy
Copy link
Copy Markdown
Member

No, I don't think we should merge this PR as is. The change on its own is quite weird.

We must preserve the old scope across a restore point, and here we a preserving the new scope. The new scope shold be rooted through the task object itself, so it's very counter intuitive to have it put onto a GC frame again.

If we don't preserve the old scope correctly there might be other opportunities for bugs, What about nested with scopes?

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 12, 2026

No, we shouldn't ship disparate fixes to different versions - that leaves the potential for bugs in the backports that are not present on master, which is always annoying.

@vchuravy
Copy link
Copy Markdown
Member

Or do we just also need to insert a write-barrier here from the Task to the new scope?

A write barrier should be inserted, it's technically not needed, since the task is young and thus the wb is a no-op, but it's cleaner to have it here.

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 12, 2026

Write barrier fix is in #60617, which is also needed.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

@vchuravy I think the WB fix is needed, because if the Task is promoted to old (and then it's switched away) then nothing is rooting the Scope, so it could be freed if the scope is young but the task is old?

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

@Keno: I'm not sure i agree with that principle. Different versions already are different since they have different features and refactorings merged. I feel like backporting the most minimal fix to the code that's present on that version is always preferable to backporting larger changes, since larger changes introduce the possibility of introducing new bugs in the patch releases (which TBH we see happen all the time).

I do agree with you guys that the proposed change would make the code more sensible overall, but it will be a bigger change, and it requires #60617 as well. (I'm working on that now, in another branch, but not sure i've gotten it right yet.)

@KristofferC: Since you are the backport-king, do you have a preference?

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 12, 2026

and it requires #60617 as well

Both changes require that PR

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

@Keno can you explain why? it seemed like this PR fixes the issue on its own. But are you saying there's another possible issue as well that is fixed by #60617?

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

Also, just to make sure: Do we actually scan the Task's scope field during the GC mark phase?
If not, we will need to add that as part of the new approach as well, yes?
So far, I don't think we do?

vchuravy added a commit that referenced this pull request Jan 12, 2026
Fixes
#60620 (comment)
previously we were preserving the new scope across the exception
handler, instead of the scope that was actually stored in the eh
context.

The additionaly write_barriers are probably not necessary (the task
should always be young.

Alternative to #60620
@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 12, 2026

Also, just to make sure: Do we actually scan the Task's scope field during the GC mark phase?
If not, we will need to add that as part of the new approach as well, yes?
So far, I don't think we do?

We scan all pointer fields:

julia/src/gc-stock.c

Lines 2360 to 2370 in 64d4518

uint8_t *obj8_begin = (uint8_t *)jl_dt_layout_ptrs(layout);
uint8_t *obj8_end = obj8_begin + npointers;
// assume tasks always reference young objects: set lowest bit
uintptr_t nptr = (npointers << 2) | 1 | bits;
new_obj = gc_mark_obj8(ptls, obj8_parent, obj8_begin, obj8_end, nptr);
if (new_obj != NULL) {
if (!remset_object)
goto mark_obj;
else
gc_ptr_queue_push(mq, new_obj);
}

@vchuravy
Copy link
Copy Markdown
Member

Does #60647 make sense to you @NHDaly?

The crux is that jl_enter_handler stores ct->scope, but we only preserve the new scope (and thus lose the root to the old scope).

And yes we scan all pointer fields of a task.

@Keno
Copy link
Copy Markdown
Member

Keno commented Jan 12, 2026

@Keno can you explain why? it seemed like this PR fixes the issue on its own. But are you saying there's another possible issue as well that is fixed by #60617?

There are two unrelated bugs, but both can result in the scope (or for #60617 any other object referenced by the task, including on the stack) to not be marked under specific circumstances. This fix is likely more frequent, but we suspect the other is an issue as well.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jan 12, 2026

Okay excellent. Thanks folks. Yes, @vchuravy the fix in #60647 makes sense to me. I had basically the same change locally and was about to push a branch, but i was working on how to reword the comment. :) I've left that as a suggestion on your PR. I'll approve as well.

NHDaly pushed a commit to RelationalAI/julia that referenced this pull request Jan 12, 2026
Fixes
JuliaLang#60620 (comment)
previously we were preserving the new scope across the exception
handler, instead of the scope that was actually stored in the eh
context.

The additionaly write_barriers are probably not necessary (the task
should always be young.

Alternative to JuliaLang#60620
@vchuravy vchuravy added backport 1.11 Change should be backported to release-1.11 backport 1.13 Change should be backported to release-1.13 labels Jan 13, 2026
@vchuravy

This comment was marked as outdated.

@vchuravy
Copy link
Copy Markdown
Member

Replaced by #60647

@vchuravy vchuravy closed this Jan 13, 2026
vchuravy added a commit that referenced this pull request Jan 13, 2026
Fixes
#60620 (comment)
previously we were preserving the new scope across the exception
handler, instead of the scope that was actually stored in the eh
context.

The additionaly write_barriers are probably not necessary (the task
should always be young.

Alternative to #60620
@vchuravy vchuravy deleted the nhd-59483-unrooted-Task-scope-fix branch January 14, 2026 13:49
@KristofferC KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 1.11 Change should be backported to release-1.11 backport 1.13 Change should be backported to release-1.13

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GC error on Julia 1.12

5 participants