Refactor how IoContext aborts are implemented by kentonv · Pull Request #4165 · cloudflare/workerd

kentonv · 2025-05-18T19:32:37Z

IoContext abort logic evolved gradually and, much like evolution, the result was a mess.

This PR attempts to improve this.

Note that this is based on top of #4161; please review that first, and I'll keep this one as a draft in the meantime.

Aside from cleaning things up, this fixes two bugs:

IoContext::abortException was only set for aborts caused by someone explicitly calling IoContext::abort(), but many kinds of aborts (such as input gate and output gate failures) did not come in through this path. So, the previous attempt to make run() rethrow the abort exception if called post-abort did not always work. Now, all kinds of aborts call abort(), so the exception propagates properly.
In actors, aborting a request did not actually cancel I/O tasks, bit it did cancel the drain() thus causing all IncomingRequests to end. If someone continued to hold open a stub to the actor, I/O tasks could keep running and then generate spurious expected !incomingRequests.empty() errors later when they tried to re-enter the isolate.
There were various race where an aborted context might throw some exception at the client that was not the abort exception, such as throwing "This promise will never resolve" or "The script will never generate a response". Many of these should be fixed now to actually propagate the abort exception.

src/workerd/api/actor-state.c++

src/workerd/api/global-scope.c++

src/workerd/io/io-context.h

kentonv · 2025-05-20T01:36:36Z

The internal build will fail until internal PR 10559 is merged to improve the output gate test. That PR can be merged first; it doesn't depend on this PR, it just enables this PR.

src/workerd/io/io-context.c++

src/workerd/api/actor-state.c++

kentonv · 2025-05-20T23:05:53Z

Just rebased. Unfortunately there's one test in the internal codebase that does depend on that error message so I had to go open a PR for that...

irvinebroque

There were various race where an aborted context might throw some exception at the client that was not the abort exception, such as throwing "This promise will never resolve" or "The script will never generate a response". Many of these should be fixed now to actually propagate the abort exception.

This is awesome — is there a basic example we could use in a changelog entry?

Want to highlight to customers that we are fixing things that result in opaque errors for them.

src/workerd/io/io-context.c++

Will make subsequent changes easier.

Prior to this, `abort()` was just one one way of fulfilling the abort promise, but other ways were joined into the promise in the constructor. Confusingly, this meant that `abortException` was not set correctly when aborting for any of those other reasons. Now the constructor calls `abortWhen()` to wait on each of those promises that were previously joined in. When any promise passed to `abortWhen()` eventually rejects, `abort()` is called.

I can't think of any reason why this was done in a different place, so let's consolidate.

… as well.

I couldn't find any existing test.

I've always wanted to do this. Node's process.exit() already figured it out, so this copies its technique. Relatedly, we can have `runImpl()` rethrow the correct abort exception, yay.

This means they don't need the `finalize()` mechanism anymore. Also this means we don't need to separately join with `onAbort()` in `WorkerEntrypoint::request()`, becaues the thing `ServiceWorkerGlobalScope::request()` returns is already an `awaitJs()` promise.

Note: This has nothing to do with GC finalization. This was sort of an alternative abort mechanism where we could cause all promises exported from the IoContext to reject when we detected a hang via PendingEvents. We can just use `abort()` instead.

It doesn't really make sense that we'd null it out specifically after a hang but not in other cases. Also remove the seemingly-irrelevant nulling out of `promiseContextTag` after a hang.

Fixes the errors in our test.

This should prevent lingering tasks from failing later due to there being no IncomingRequest. I'm pretty sure the only way that happens is when the actor was aborted.

There are now two methods: * js.terminateNextExecution(), which has the previous behavior. * js.terminateExecutionNow(), which forces the termination to happen synchronously and always throws. This lets us take the hack that had proliferated to three different places and keep it in one place.

balusch · 2025-09-03T14:20:32Z

src/workerd/jsg/jsg.c++

+
+  // HACK: This has been observed to reliably make V8 check the termination flag and raise the
+  //   uncatchable termination exception.
+  jsg::check(v8::JSON::Stringify(v8Context(), str()));


That's an interesting (and weird, as always in V8) finding. Is there any V8 doc or discussion about it?

Nope. It was just the first thing that worked and we didn't spend too much time thinking about it. Probably there are many V8 APIs we could call which should cause it to check the termination flag.

Of course, it would be better to add an explicit API to V8 to ask it to check the flag but that takes time and there's so much to do. :) If this stops working the tests will tell us.

Yeah, having an explicit API would definitely be ideal, though I imagine it wouldn’t be easy to get into V8. Still, I’m impressed you guys found this workaround — thanks for sharing it!

kentonv requested a review from ObsidianMinor May 18, 2025 19:32

ObsidianMinor reviewed May 19, 2025

View reviewed changes

src/workerd/api/actor-state.c++ Outdated Show resolved Hide resolved

ObsidianMinor reviewed May 19, 2025

View reviewed changes

src/workerd/api/global-scope.c++ Outdated Show resolved Hide resolved

ObsidianMinor reviewed May 19, 2025

View reviewed changes

src/workerd/io/io-context.h Show resolved Hide resolved

kentonv mentioned this pull request May 19, 2025

Wrap top-level server-side RPC call in RevokerMembrane #4127

Closed

Base automatically changed from kenton/reentry-callback to main May 20, 2025 01:11

kentonv force-pushed the kenton/refactor-abort branch from f8f8b6a to 482bf4d Compare May 20, 2025 01:33

kentonv marked this pull request as ready for review May 20, 2025 01:34

kentonv requested review from a team as code owners May 20, 2025 01:34

kentonv commented May 20, 2025

View reviewed changes

src/workerd/io/io-context.c++ Outdated Show resolved Hide resolved

jasnell reviewed May 20, 2025

View reviewed changes

src/workerd/api/actor-state.c++ Outdated Show resolved Hide resolved

kentonv force-pushed the kenton/refactor-abort branch from 482bf4d to 6f4aa13 Compare May 20, 2025 16:01

jasnell approved these changes May 20, 2025

View reviewed changes

ObsidianMinor approved these changes May 20, 2025

View reviewed changes

kentonv force-pushed the kenton/refactor-abort branch from 6f4aa13 to c7df745 Compare May 20, 2025 23:03

irvinebroque reviewed May 21, 2025

View reviewed changes

src/workerd/io/io-context.c++ Outdated Show resolved Hide resolved

kentonv force-pushed the kenton/refactor-abort branch 2 times, most recently from 5b82ef4 to b7b0431 Compare May 21, 2025 22:07

kentonv added 9 commits May 22, 2025 18:10

Cleanup: Use onAbort() rather than abortPromise.addBranch() directly.

38acd93

Will make subsequent changes easier.

Refactor: Watch OutputGate brokenness in IoContext instead of Actor.

5706658

I can't think of any reason why this was done in a different place, so let's consolidate.

Refactor: Report actor constructor failure through IoContext::abort()…

134092c

… as well.

Test: Add a test for "The script will never generate a response."

3a548d6

I couldn't find any existing test.

Refactor: Make ctx.abort() actually terminate execution.

aef30c5

I've always wanted to do this. Node's process.exit() already figured it out, so this copies its technique. Relatedly, we can have `runImpl()` rethrow the correct abort exception, yay.

Cleanup: Don't make IoContext::tasks nullable.

41b8dd6

It doesn't really make sense that we'd null it out specifically after a hang but not in other cases. Also remove the seemingly-irrelevant nulling out of `promiseContextTag` after a hang.

kentonv added 4 commits May 22, 2025 18:10

Pass actual abort reason to canceler in IoContext destructor.

c7ed7f0

Fixes the errors in our test.

Cancel all tasks in IncomingRequest destructor if aborted.

f1a1a0e

This should prevent lingering tasks from failing later due to there being no IncomingRequest. I'm pretty sure the only way that happens is when the actor was aborted.

Improve the error message for hung worker event.

0542d70

kentonv force-pushed the kenton/refactor-abort branch from b7b0431 to 44d3004 Compare May 22, 2025 23:10

kentonv merged commit cc0d789 into main May 23, 2025
18 checks passed

kentonv deleted the kenton/refactor-abort branch May 23, 2025 01:28

balusch reviewed Sep 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor how IoContext aborts are implemented#4165

Refactor how IoContext aborts are implemented#4165
kentonv merged 13 commits intomainfrom
kenton/refactor-abort

kentonv commented May 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kentonv commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

kentonv commented May 20, 2025

Uh oh!

irvinebroque left a comment

Uh oh!

Uh oh!

Uh oh!

balusch Sep 3, 2025

Uh oh!

kentonv Sep 3, 2025

Uh oh!

balusch Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kentonv commented May 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kentonv commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

kentonv commented May 20, 2025

Uh oh!

irvinebroque left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

balusch Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

kentonv Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

balusch Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants