Skip to content

Refactor actor handling in workerd, including some bugfixes#4041

Merged
kentonv merged 23 commits intomainfrom
kenton/refactor-actor
May 7, 2025
Merged

Refactor actor handling in workerd, including some bugfixes#4041
kentonv merged 23 commits intomainfrom
kenton/refactor-actor

Conversation

@kentonv
Copy link
Member

@kentonv kentonv commented Apr 28, 2025

This started with me refactoring the actor handling in server.c++ to make it more amenable to changes I want to make.

Along the way, I realized there were actual bugs here, and fixed a few:

  1. In production, when an actor becomes broken, all existing stubs become broken. Reusing the stubs continuously throws the same error. But in workerd, this was not the case: instead, the next call on a stub would start a new instance of the actor, "fixing itself". It's important that workerd match production so that people can properly test their code, so this fixes the bug.
  2. RPC calls on broken stubs were producing "internal error", and under the hood a "PromiseFulfiller was never fulfilled" Sentry error. They should now rethrow the correct exception. (This affects production, too.)
  3. The type of ctx.id in a Durable Object is supposed to be the special DurableObjectId type, but due to a bug introduced years ago, in workerd it was being filled in as a plain string.

There are a lot of commits here because I tried to keep each one small and easy to review. They should go fast.

Keep in mind server.c++ only applies to workerd -- this code is not used in production.

I believe @MellowYarker is most familiar with this code so should be the main reviewer. Adding @shrima-cf because I think she recently hit these bugs while writing tests.

@kentonv kentonv requested review from a team as code owners April 28, 2025 22:51
@kentonv kentonv force-pushed the kenton/refactor-actor branch 2 times, most recently from 147c1fb to 22e635e Compare April 29, 2025 00:43
kentonv added 21 commits April 29, 2025 09:10
The map values are type `Own<ActorContainer>`. If we're in `ActorContainer`'s destructor, then the map entry must already have been erased, obviously.
AFAICT this IIFE didn't accomplish anything as its result was immediately returned.
We can create the ActorContainer and even return an existing Actor without taking the isolate lock, as we aren't executing any JS code in the target isolate.
Since we have to make a copy of the ID anyway, capture the copy upfront.
Keeping a whole separate table for these, keyed on the same string, is a waste. It would be pretty bad if they got out-of-sync anyway.
Turns out this is no longer used.
…ets.

I want to make sure I don't break this with subsequent changes.
In the next commit I'll be removing ActorContainerRef and just make ActorContainer itself refcounted. This lays some groundwork for that.
The body of `start()` is entirely moved code with no logic changes.
`getActor()` is the only public interface now.
In the process, get rid of getActorImpl() and GetActorResult.
An RPC call on an already-broken stub should throw the brokenness reason, but instead it threw an "internal error" which was a "PromiseFulfiller not fulfilled" error under the hood.

I'm fixing that here because I need it fixed for a test I'm writing in a later commit.
In production, when an actor breaks, any stubs pointing at it are permanently broken. Using them again will repeatedly rethrow the broken reason.

In workerd, though, the stub would "repair itself", switching to a new DO on subsequent requests. This is incorrect!

This commit makes workerd match production.
This API wasn't acutally aborting the actors, it was just leaving them unreachable. With the recent changes, the existing stubs would continue pointing at the old instances of the objects, which still worked. But even before those changes, there would have been a problem if the actors were doing work in the background -- that work would keep running even after they were supposedly aborted, even as new instances could start up in parallel, leading to split brain.
In a Durable Object, the type of `ctx.id` is supposed to be of type `DurableObjectId`, and in production, it is.

Way way back in #605, a bug was introduced such that `ctx.id` became a string for Durable Objects in workerd. But `ctx.id` is only supposed to be a string for "ephemeral objects" (aka colo-local actors).

Somehow, nobody ever noticed? Probably because the only thing most people would do with it anyway is stringify it, which JavaScript is really good at doing automatically.

Anyway, this fixes the bug, restoring the proper type.

(Also this tidies up ownership of the key strings in the actor map. The key is now owned by the ActorContainer.)
@kentonv kentonv force-pushed the kenton/refactor-actor branch from 22e635e to 4f936ed Compare April 29, 2025 14:11
@kentonv
Copy link
Member Author

kentonv commented Apr 29, 2025

(Rebased on main in hopes of fixing internal build.)

The commit that eliminated ActorContainerRef accidentally removed the code to update the last-access time. This restores it in the appropriate places. I'm tacking this on the end of the PR as going back and rewriting history would be tedious at this point.
@kentonv
Copy link
Member Author

kentonv commented Apr 30, 2025

The internal build had passed, but after my last commit it broke again, not because of anything I changed, but because it always uses the latest version of the internal codebase, which has had breaking changes since the commit that my workerd PR is based on.

I'm not going to rebase again for now -- let's just assume the internal build is not actually broken by my change.

@kentonv
Copy link
Member Author

kentonv commented May 2, 2025

@MellowYarker ping?

@MellowYarker
Copy link
Contributor

@kentonv sorry, didn't see this at all, I actually just saw the internal PR, was interested, and then followed the trail of breadcrumbs and found this PR. Will take a look some time tomorrow.

Copy link
Contributor

@MellowYarker MellowYarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much LGTM, though there's a lot of changes so I want to take another pass before approving. Thanks for cleaning this code up, and sorry for not reviewing earlier. I need to clean up my github notifications...

Copy link
Contributor

@MellowYarker MellowYarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup!

@kentonv
Copy link
Member Author

kentonv commented May 7, 2025

For some reason GitHub isn't accepting @MellowYarker as an approver. This is obviously wrong, so I am going to use my powers to bypass the merge check.

@kentonv kentonv merged commit 2958fd7 into main May 7, 2025
18 checks passed
@kentonv kentonv deleted the kenton/refactor-actor branch May 7, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants