Refactor actor handling in workerd, including some bugfixes by kentonv · Pull Request #4041 · cloudflare/workerd

kentonv · 2025-04-28T22:51:47Z

This started with me refactoring the actor handling in server.c++ to make it more amenable to changes I want to make.

Along the way, I realized there were actual bugs here, and fixed a few:

In production, when an actor becomes broken, all existing stubs become broken. Reusing the stubs continuously throws the same error. But in workerd, this was not the case: instead, the next call on a stub would start a new instance of the actor, "fixing itself". It's important that workerd match production so that people can properly test their code, so this fixes the bug.
RPC calls on broken stubs were producing "internal error", and under the hood a "PromiseFulfiller was never fulfilled" Sentry error. They should now rethrow the correct exception. (This affects production, too.)
The type of ctx.id in a Durable Object is supposed to be the special DurableObjectId type, but due to a bug introduced years ago, in workerd it was being filled in as a plain string.

There are a lot of commits here because I tried to keep each one small and easy to review. They should go fast.

Keep in mind server.c++ only applies to workerd -- this code is not used in production.

I believe @MellowYarker is most familiar with this code so should be the main reviewer. Adding @shrima-cf because I think she recently hit these bugs while writing tests.

The map values are type `Own<ActorContainer>`. If we're in `ActorContainer`'s destructor, then the map entry must already have been erased, obviously.

AFAICT this IIFE didn't accomplish anything as its result was immediately returned.

We can create the ActorContainer and even return an existing Actor without taking the isolate lock, as we aren't executing any JS code in the target isolate.

Since we have to make a copy of the ID anyway, capture the copy upfront.

Keeping a whole separate table for these, keyed on the same string, is a waste. It would be pretty bad if they got out-of-sync anyway.

Pure code move, no changes.

Turns out this is no longer used.

…ets. I want to make sure I don't break this with subsequent changes.

In the next commit I'll be removing ActorContainerRef and just make ActorContainer itself refcounted. This lays some groundwork for that.

This seems more straightforward.

The body of `start()` is entirely moved code with no logic changes.

`getActor()` is the only public interface now.

…ime.

In the process, get rid of getActorImpl() and GetActorResult.

An RPC call on an already-broken stub should throw the brokenness reason, but instead it threw an "internal error" which was a "PromiseFulfiller not fulfilled" error under the hood. I'm fixing that here because I need it fixed for a test I'm writing in a later commit.

In production, when an actor breaks, any stubs pointing at it are permanently broken. Using them again will repeatedly rethrow the broken reason. In workerd, though, the stub would "repair itself", switching to a new DO on subsequent requests. This is incorrect! This commit makes workerd match production.

This API wasn't acutally aborting the actors, it was just leaving them unreachable. With the recent changes, the existing stubs would continue pointing at the old instances of the objects, which still worked. But even before those changes, there would have been a problem if the actors were doing work in the background -- that work would keep running even after they were supposedly aborted, even as new instances could start up in parallel, leading to split brain.

In a Durable Object, the type of `ctx.id` is supposed to be of type `DurableObjectId`, and in production, it is. Way way back in #605, a bug was introduced such that `ctx.id` became a string for Durable Objects in workerd. But `ctx.id` is only supposed to be a string for "ephemeral objects" (aka colo-local actors). Somehow, nobody ever noticed? Probably because the only thing most people would do with it anyway is stringify it, which JavaScript is really good at doing automatically. Anyway, this fixes the bug, restoring the proper type. (Also this tidies up ownership of the key strings in the actor map. The key is now owned by the ActorContainer.)

kentonv · 2025-04-29T14:16:20Z

(Rebased on main in hopes of fixing internal build.)

The commit that eliminated ActorContainerRef accidentally removed the code to update the last-access time. This restores it in the appropriate places. I'm tacking this on the end of the PR as going back and rewriting history would be tedious at this point.

kentonv · 2025-04-30T20:51:16Z

The internal build had passed, but after my last commit it broke again, not because of anything I changed, but because it always uses the latest version of the internal codebase, which has had breaking changes since the commit that my workerd PR is based on.

I'm not going to rebase again for now -- let's just assume the internal build is not actually broken by my change.

kentonv · 2025-05-02T14:36:55Z

@MellowYarker ping?

MellowYarker · 2025-05-02T23:09:50Z

@kentonv sorry, didn't see this at all, I actually just saw the internal PR, was interested, and then followed the trail of breadcrumbs and found this PR. Will take a look some time tomorrow.

MellowYarker

Pretty much LGTM, though there's a lot of changes so I want to take another pass before approving. Thanks for cleaning this code up, and sorry for not reviewing earlier. I need to clean up my github notifications...

src/workerd/server/server.c++

MellowYarker

Thanks for the cleanup!

kentonv · 2025-05-07T19:48:05Z

For some reason GitHub isn't accepting @MellowYarker as an approver. This is obviously wrong, so I am going to use my powers to bypass the merge check.

kentonv requested review from MellowYarker and shrima-cf April 28, 2025 22:51

kentonv requested review from a team as code owners April 28, 2025 22:51

kentonv force-pushed the kenton/refactor-actor branch 2 times, most recently from 147c1fb to 22e635e Compare April 29, 2025 00:43

kentonv mentioned this pull request Apr 29, 2025

Implement Durable Object alarms in workerd #605

Merged

kentonv added 21 commits April 29, 2025 09:10

Cleanup: Remove redundant map erase in ~ActorContainer.

ba7da24

The map values are type `Own<ActorContainer>`. If we're in `ActorContainer`'s destructor, then the map entry must already have been erased, obviously.

Cleanup: Remove redundant IIFE.

3f30574

AFAICT this IIFE didn't accomplish anything as its result was immediately returned.

Refactor: getActorImpl() only needs a lock to create a new Actor.

c391f18

We can create the ActorContainer and even return an existing Actor without taking the isolate lock, as we aren't executing any JS code in the target isolate.

Refactor: Don't use [&] capture for makeActorCache.

d791eff

Since we have to make a copy of the ID anyway, capture the copy upfront.

Refactor: Move onBroken monitoring into ActorContainer.

7598f5d

Keeping a whole separate table for these, keyed on the same string, is a waste. It would be pretty bad if they got out-of-sync anyway.

Move: ActorContainer::handleShutdown() to private.

6fd1dcd

Pure code move, no changes.

Refactor: Make ActorContainer::actor private.

7111e69

Cleanup: Remove ActorContainer::onBrokenTriggered.

859b616

Turns out this is no longer used.

Test: Add test coverage for aborting an actor with hibernated WebSock…

9e6677f

…ets. I want to make sure I don't break this with subsequent changes.

Refactor: Prepare for ActorContainer to live past becoming broken.

9b45f4a

In the next commit I'll be removing ActorContainerRef and just make ActorContainer itself refcounted. This lays some groundwork for that.

Refactor: Delete ActorContainerRef. Just make ActorContainer refcounted.

22b3d64

This seems more straightforward.

Refactor: Convert getActorImpl() into coroutine.

243da8e

Refactor: Move bulk of getActorImpl() into ActorContainer::getActor().

37eb570

The body of `start()` is entirely moved code with no logic changes.

Refactor: Inline ActorContainer::tryGetActor() and setActor().

01eb05e

`getActor()` is the only public interface now.

Cleanup: Make sure ActorContainer::start() is only called once at a t…

9207d52

…ime.

Refactor: Make public method ActorNamespace::getActorContainer().

9917aab

In the process, get rid of getActorImpl() and GetActorResult.

Refactor: Move startRequest logic to ActorContainer.

3c0543c

kentonv force-pushed the kenton/refactor-actor branch from 22e635e to 4f936ed Compare April 29, 2025 14:11

kentonv mentioned this pull request May 2, 2025

Don't take locks in Worker::Actor's constructor or destructor #4080

Merged

MellowYarker reviewed May 3, 2025

View reviewed changes

Improve comments per code review feedback.

da8178d

MellowYarker approved these changes May 7, 2025

View reviewed changes

kentonv merged commit 2958fd7 into main May 7, 2025
18 checks passed

kentonv deleted the kenton/refactor-actor branch May 7, 2025 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor actor handling in workerd, including some bugfixes#4041

Refactor actor handling in workerd, including some bugfixes#4041
kentonv merged 23 commits intomainfrom
kenton/refactor-actor

kentonv commented Apr 28, 2025 •

edited

Loading

Uh oh!

kentonv commented Apr 29, 2025

Uh oh!

kentonv commented Apr 30, 2025

Uh oh!

kentonv commented May 2, 2025

Uh oh!

MellowYarker commented May 2, 2025

Uh oh!

MellowYarker left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MellowYarker left a comment

Uh oh!

kentonv commented May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kentonv commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kentonv commented Apr 29, 2025

Uh oh!

kentonv commented Apr 30, 2025

Uh oh!

kentonv commented May 2, 2025

Uh oh!

MellowYarker commented May 2, 2025

Uh oh!

MellowYarker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MellowYarker left a comment

Choose a reason for hiding this comment

Uh oh!

kentonv commented May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kentonv commented Apr 28, 2025 •

edited

Loading