EW-8447 Fix CPU profiling by harrishancock · Pull Request #2497 · cloudflare/workerd

harrishancock · 2024-08-08T09:57:50Z

When breakpoint debugging was introduced, it added a new thread, the InspectorService thread. This thread is necessary in order to continue performing WebSocket I/O while the Isolate is paused at a breakpoint. The feature also moved inspector message dispatching onto this same InspectorService thread. For most inspector messages, this turns out to be benign, but not for CPU profiling start/stop messages: those must be dispatched on the thread which we desire to profile. The end result was that breakpoint debugging caused us to profile JavaScript on the InspectorService thread (e.g., anything executed on the Chrome Devtools console) instead of the Isolate thread, as originally intended.

This turns out to be relatively easy to fix, because breakpoint debugging would already dispatch inspector messages on the Isolate thread in a specific case: when the Isolate was paused at a breakpoint, meaning there are already exists some machinery to pass inspector messages across threads. This commit simply makes it so that all messages are passed to the Isolate thread for dispatching, instead of only when paused at a breakpoint.

We (obviously) do not have any tests for CPU profiling, or else we would have caught this issue much earlier. This commit does not add any such tests. Since we'd like to fix this issue sooner than later, my intent is to land this commit first and add tests later -- we will not close related tickets until such tests are implemented.

I have smoke tested both profiling and breakpoint debugging; both work with the change in this commit.

danlapid · 2024-08-08T10:25:44Z

src/workerd/io/worker.c++

+      // Although inspector I/O must happen on the InspectorService thread (to make sure breakpoints
+      // don't block inspector I/O), inspector messages must be actually dispatched on the Isolate
+      // thread. So, we run the dispatch loop on the Isolate thread.
+      auto dispatchLoopPromise = isolateThreadExecutor->executeAsync([this]() {


nit: I would move the executeAsync into the dispatchLoop method, it's a bit clearer for future readers imho.
otherwise lgtm

I felt the same intuition, but wasn't confident the result was actually more readable, given the infamous footgunnery of lambda coroutines. The main choices that I see are:

Rename dispatchLoop() to dispatchLoopImpl() and make a new dispatchLoop() whose body is a call to dispatchLoopImpl() inside a lambda passed to executeAsync().

Or, move dispatchLoop()'s body into a constexpr lambda which accepts incomingQueueNotifier and channel as parameters. (Alternatively, it could accept a WebSocketIoHandler& self parameter, and self. qualify the member accesses.)

Or, move dispatchLoop()'s body into a non-constexpr lambda which captures this, and wrap the lambda in kj::coCapture().

Or, convert dispatchLoop() to a recursive .then() function, which I think would also require a separate dispatchLoopImpl(), since we only want to call executeAsync() a single time.

Or, the PR as written.

Not sure what you mean, seeing as how this (which is just a ptr) should outlive the promise returned by executeAsync there should be no difference between the current implementation and

// Must be called on the Isolate thread. kj::Promise<void> dispatchLoop() { return isolateThreadExecutor->executeAsync([this]() { for (;;) { co_await incomingQueueNotifier->awaitNotification(); KJ_IF_SOME(c, channel) { co_await c.dispatchProtocolMessages(this->incomingQueue); } } }); }

The whole coro lambda fiasco is about local stack owned variables captured in lambdas, this isn't such a variable.

Ah, you are right that that alternative actually is safe, because executeAsync() saves the lambda for the duration of the promise it returns, which is a guarantee I forgot it provides: https://github.com/capnproto/capnproto/blob/6cb55ac7d523c24047e6bbef0679027561f3c0d4/c%2B%2B/src/kj/async.h#L1040-L1041

To make sure we're on the same page, though: my understanding is that this being a pointer does not make this inherently safe. That is, my understanding is that it's the lifetime of the captured pointers/references (i.e., the lifetime of the lambda), not the lifetime of the pointed-to objects (e.g. *this), that matters.

My fear is that someone might see a this-capturing lambda coroutine and assume it is a generally safe pattern to use, and pass such a thing to, say, Promise<T>::then(), which does not preserve the lambda's lifetime. Because of this, I'd much prefer a blanket ban against non-constexpr, non-coCapture()ed coroutine lambdas, even if some usages like this one can be assumed to be safe.

irvinebroque · 2024-08-08T13:34:57Z

👏

harrishancock · 2024-08-08T13:57:20Z

Added needs-internal-pr label until I figure out how to reconcile this with our internal codebase, which has no concept of an "Isolate thread". We might need multiple attachInspector() implementations. :/

When breakpoint debugging was introduced, it added a new thread, the InspectorService thread. This thread is necessary in order to continue performing WebSocket I/O while the Isolate is paused at a breakpoint. The feature _also_ moved inspector message dispatching onto this same InspectorService thread. For most inspector messages, this turns out to be benign, but not for CPU profiling start/stop messages: those must be dispatched on the thread which we desire to profile. The end result was that breakpoint debugging caused us to profile JavaScript on the InspectorService thread (e.g., anything executed on the Chrome Devtools console) instead of the Isolate thread, as originally intended. This turns out to be relatively easy to fix, because breakpoint debugging would already dispatch inspector messages on the Isolate thread in a specific case: when the Isolate was paused at a breakpoint, meaning there already exists some machinery to pass inspector messages across threads. This commit simply makes it so that all messages are passed to the Isolate thread for dispatching, instead of only when paused at a breakpoint. We (obviously) do not have any tests for CPU profiling, or else we would have caught this issue much earlier. This commit does not add any such tests. Since we'd like to fix this issue sooner than later, my intent is to land this commit first and add tests later -- we will not close related tickets until such tests are implemented. I have smoke tested both profiling and breakpoint debugging; both work with the change in this commit.

harrishancock · 2024-08-09T00:37:34Z

We might need multiple attachInspector() implementations. :/

Obviously in retrospect, we already do. I modified the implementation which our internal codebase uses so that no internal change is required anymore (this change is effectively a no-op from its perspective), smoke tested profiling and breakpoint debugging once more on both codebases, and removed the needs-internal-pr label.

Once @danlapid / @mar-cf give me a thumbs up on the latest diff, let's merge.

danlapid · 2024-08-09T14:03:50Z

We might need multiple attachInspector() implementations. :/

Obviously in retrospect, we already do. I modified the implementation which our internal codebase uses so that no internal change is required anymore (this change is effectively a no-op from its perspective), smoke tested profiling and breakpoint debugging once more on both codebases, and removed the needs-internal-pr label.

Once @danlapid / @mar-cf give me a thumbs up on the latest diff, let's merge.

LGTM

irvinebroque · 2024-08-09T14:08:29Z

If we ship this and cut a workerd release, then can get the ball rolling on testing out in Wrangler/Miniflare, both with the current fork of Chrome Devtools and an updated one.

mhart · 2024-08-16T22:33:01Z

Is this in the latest release of workerd? I tried CPU profiling with whichever workerd is distributed with wrangler 3.72.0 and it froze on me when I tried to take a profile. Both aspects of the chrome tools UI, as well as the workerd server itself stopped responding to requests. I couldn't find a way of successfully taking a profile.

irvinebroque · 2024-08-17T00:34:35Z

I don't think it made it to wrangler 3.72.0

cloudflare/workers-sdk#6490

https://github.com/cloudflare/workerd/releases

mhart · 2024-08-17T00:55:40Z

Ah, thanks @irvinebroque – I wonder if we should add the workerd version to wrangler --version output?

@mrbbot

This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction. I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage. I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well. Fixes #1754.

@mrbbot

This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction. I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage. I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well. Fixes #1754.

@mrbbot

This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction. I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage. I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well. Fixes #1754.

@mrbbot

This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction. I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage. I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well. Fixes #1754.

harrishancock requested review from a team as code owners August 8, 2024 09:57

harrishancock requested review from danlapid and mar-cf August 8, 2024 09:57

harrishancock mentioned this pull request Aug 8, 2024

Add V8 cross-thread profiling patches #2227

Closed

danlapid approved these changes Aug 8, 2024

View reviewed changes

mar-cf approved these changes Aug 8, 2024

View reviewed changes

harrishancock added the needs-internal-pr label Aug 8, 2024

harrishancock force-pushed the harris/EW-8447-fix-cpu-profiling branch from 4fdcaae to 19c8ed5 Compare August 9, 2024 00:35

harrishancock removed the needs-internal-pr label Aug 9, 2024

harrishancock merged commit 7a101e4 into main Aug 9, 2024

harrishancock deleted the harris/EW-8447-fix-cpu-profiling branch August 9, 2024 14:28

harrishancock mentioned this pull request Aug 20, 2024

EW-8447 Add regression test for CPU profiling #2558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EW-8447 Fix CPU profiling#2497

EW-8447 Fix CPU profiling#2497
harrishancock merged 1 commit intomainfrom
harris/EW-8447-fix-cpu-profiling

harrishancock commented Aug 8, 2024

Uh oh!

danlapid Aug 8, 2024

Uh oh!

harrishancock Aug 8, 2024

Uh oh!

danlapid Aug 8, 2024 •

edited

Loading

Uh oh!

harrishancock Aug 8, 2024

Uh oh!

irvinebroque commented Aug 8, 2024

Uh oh!

harrishancock commented Aug 8, 2024

Uh oh!

harrishancock commented Aug 9, 2024 •

edited

Loading

Uh oh!

danlapid commented Aug 9, 2024

Uh oh!

irvinebroque commented Aug 9, 2024

Uh oh!

mhart commented Aug 16, 2024

Uh oh!

irvinebroque commented Aug 17, 2024 •

edited

Loading

Uh oh!

mhart commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

harrishancock commented Aug 8, 2024

Uh oh!

danlapid Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

harrishancock Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

danlapid Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harrishancock Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

irvinebroque commented Aug 8, 2024

Uh oh!

harrishancock commented Aug 8, 2024

Uh oh!

harrishancock commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danlapid commented Aug 9, 2024

Uh oh!

irvinebroque commented Aug 9, 2024

Uh oh!

mhart commented Aug 16, 2024

Uh oh!

irvinebroque commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhart commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danlapid Aug 8, 2024 •

edited

Loading

harrishancock commented Aug 9, 2024 •

edited

Loading

irvinebroque commented Aug 17, 2024 •

edited

Loading