Skip to content

EW-8447 Fix CPU profiling#2497

Merged
harrishancock merged 1 commit intomainfrom
harris/EW-8447-fix-cpu-profiling
Aug 9, 2024
Merged

EW-8447 Fix CPU profiling#2497
harrishancock merged 1 commit intomainfrom
harris/EW-8447-fix-cpu-profiling

Conversation

@harrishancock
Copy link
Copy Markdown
Collaborator

When breakpoint debugging was introduced, it added a new thread, the InspectorService thread. This thread is necessary in order to continue performing WebSocket I/O while the Isolate is paused at a breakpoint. The feature also moved inspector message dispatching onto this same InspectorService thread. For most inspector messages, this turns out to be benign, but not for CPU profiling start/stop messages: those must be dispatched on the thread which we desire to profile. The end result was that breakpoint debugging caused us to profile JavaScript on the InspectorService thread (e.g., anything executed on the Chrome Devtools console) instead of the Isolate thread, as originally intended.

This turns out to be relatively easy to fix, because breakpoint debugging would already dispatch inspector messages on the Isolate thread in a specific case: when the Isolate was paused at a breakpoint, meaning there are already exists some machinery to pass inspector messages across threads. This commit simply makes it so that all messages are passed to the Isolate thread for dispatching, instead of only when paused at a breakpoint.

We (obviously) do not have any tests for CPU profiling, or else we would have caught this issue much earlier. This commit does not add any such tests. Since we'd like to fix this issue sooner than later, my intent is to land this commit first and add tests later -- we will not close related tickets until such tests are implemented.

I have smoke tested both profiling and breakpoint debugging; both work with the change in this commit.

// Although inspector I/O must happen on the InspectorService thread (to make sure breakpoints
// don't block inspector I/O), inspector messages must be actually dispatched on the Isolate
// thread. So, we run the dispatch loop on the Isolate thread.
auto dispatchLoopPromise = isolateThreadExecutor->executeAsync([this]() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would move the executeAsync into the dispatchLoop method, it's a bit clearer for future readers imho.
otherwise lgtm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the same intuition, but wasn't confident the result was actually more readable, given the infamous footgunnery of lambda coroutines. The main choices that I see are:

  1. Rename dispatchLoop() to dispatchLoopImpl() and make a new dispatchLoop() whose body is a call to dispatchLoopImpl() inside a lambda passed to executeAsync().
  2. Or, move dispatchLoop()'s body into a constexpr lambda which accepts incomingQueueNotifier and channel as parameters. (Alternatively, it could accept a WebSocketIoHandler& self parameter, and self. qualify the member accesses.)
  3. Or, move dispatchLoop()'s body into a non-constexpr lambda which captures this, and wrap the lambda in kj::coCapture().
  4. Or, convert dispatchLoop() to a recursive .then() function, which I think would also require a separate dispatchLoopImpl(), since we only want to call executeAsync() a single time.
  5. Or, the PR as written.

Copy link
Copy Markdown
Collaborator

@danlapid danlapid Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, seeing as how this (which is just a ptr) should outlive the promise returned by executeAsync there should be no difference between the current implementation and

// Must be called on the Isolate thread.
kj::Promise<void> dispatchLoop() {
  return isolateThreadExecutor->executeAsync([this]() {
    for (;;) {
      co_await incomingQueueNotifier->awaitNotification();
      KJ_IF_SOME(c, channel) {
        co_await c.dispatchProtocolMessages(this->incomingQueue);
      }
    }
  });
}

The whole coro lambda fiasco is about local stack owned variables captured in lambdas, this isn't such a variable.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right that that alternative actually is safe, because executeAsync() saves the lambda for the duration of the promise it returns, which is a guarantee I forgot it provides: https://github.com/capnproto/capnproto/blob/6cb55ac7d523c24047e6bbef0679027561f3c0d4/c%2B%2B/src/kj/async.h#L1040-L1041

To make sure we're on the same page, though: my understanding is that this being a pointer does not make this inherently safe. That is, my understanding is that it's the lifetime of the captured pointers/references (i.e., the lifetime of the lambda), not the lifetime of the pointed-to objects (e.g. *this), that matters.

My fear is that someone might see a this-capturing lambda coroutine and assume it is a generally safe pattern to use, and pass such a thing to, say, Promise<T>::then(), which does not preserve the lambda's lifetime. Because of this, I'd much prefer a blanket ban against non-constexpr, non-coCapture()ed coroutine lambdas, even if some usages like this one can be assumed to be safe.

@irvinebroque
Copy link
Copy Markdown
Collaborator

👏

@harrishancock
Copy link
Copy Markdown
Collaborator Author

Added needs-internal-pr label until I figure out how to reconcile this with our internal codebase, which has no concept of an "Isolate thread". We might need multiple attachInspector() implementations. :/

When breakpoint debugging was introduced, it added a new thread, the InspectorService thread. This thread is necessary in order to continue performing WebSocket I/O while the Isolate is paused at a breakpoint. The feature _also_ moved inspector message dispatching onto this same InspectorService thread. For most inspector messages, this turns out to be benign, but not for CPU profiling start/stop messages: those must be dispatched on the thread which we desire to profile. The end result was that breakpoint debugging caused us to profile JavaScript on the InspectorService thread (e.g., anything executed on the Chrome Devtools console) instead of the Isolate thread, as originally intended.

This turns out to be relatively easy to fix, because breakpoint debugging would already dispatch inspector messages on the Isolate thread in a specific case: when the Isolate was paused at a breakpoint, meaning there already exists some machinery to pass inspector messages across threads. This commit simply makes it so that all messages are passed to the Isolate thread for dispatching, instead of only when paused at a breakpoint.

We (obviously) do not have any tests for CPU profiling, or else we would have caught this issue much earlier. This commit does not add any such tests. Since we'd like to fix this issue sooner than later, my intent is to land this commit first and add tests later -- we will not close related tickets until such tests are implemented.

I have smoke tested both profiling and breakpoint debugging; both work with the change in this commit.
@harrishancock harrishancock force-pushed the harris/EW-8447-fix-cpu-profiling branch from 4fdcaae to 19c8ed5 Compare August 9, 2024 00:35
@harrishancock
Copy link
Copy Markdown
Collaborator Author

harrishancock commented Aug 9, 2024

We might need multiple attachInspector() implementations. :/

Obviously in retrospect, we already do. I modified the implementation which our internal codebase uses so that no internal change is required anymore (this change is effectively a no-op from its perspective), smoke tested profiling and breakpoint debugging once more on both codebases, and removed the needs-internal-pr label.

Once @danlapid / @mar-cf give me a thumbs up on the latest diff, let's merge.

@danlapid
Copy link
Copy Markdown
Collaborator

danlapid commented Aug 9, 2024

We might need multiple attachInspector() implementations. :/

Obviously in retrospect, we already do. I modified the implementation which our internal codebase uses so that no internal change is required anymore (this change is effectively a no-op from its perspective), smoke tested profiling and breakpoint debugging once more on both codebases, and removed the needs-internal-pr label.

Once @danlapid / @mar-cf give me a thumbs up on the latest diff, let's merge.

LGTM

@irvinebroque
Copy link
Copy Markdown
Collaborator

If we ship this and cut a workerd release, then can get the ball rolling on testing out in Wrangler/Miniflare, both with the current fork of Chrome Devtools and an updated one.

@harrishancock harrishancock merged commit 7a101e4 into main Aug 9, 2024
@harrishancock harrishancock deleted the harris/EW-8447-fix-cpu-profiling branch August 9, 2024 14:28
@mhart
Copy link
Copy Markdown
Contributor

mhart commented Aug 16, 2024

Is this in the latest release of workerd? I tried CPU profiling with whichever workerd is distributed with wrangler 3.72.0 and it froze on me when I tried to take a profile. Both aspects of the chrome tools UI, as well as the workerd server itself stopped responding to requests. I couldn't find a way of successfully taking a profile.

@irvinebroque
Copy link
Copy Markdown
Collaborator

irvinebroque commented Aug 17, 2024

I don't think it made it to wrangler 3.72.0

cloudflare/workers-sdk#6490

Screenshot 2024-08-16 at 5 34 07 PM

https://github.com/cloudflare/workerd/releases

@mhart
Copy link
Copy Markdown
Contributor

mhart commented Aug 17, 2024

Ah, thanks @irvinebroque – I wonder if we should add the workerd version to wrangler --version output?

harrishancock added a commit that referenced this pull request Aug 20, 2024
This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction.

I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage.

I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well.

Fixes #1754.
harrishancock added a commit that referenced this pull request Aug 20, 2024
This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction.

I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage.

I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well.

Fixes #1754.
harrishancock added a commit that referenced this pull request Aug 20, 2024
This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction.

I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage.

I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well.

Fixes #1754.
harrishancock added a commit that referenced this pull request Aug 20, 2024
This commit adds a regression test for the CPU profiling bug reported in #1754 and fixed in #2497. It is based on @mrbbot's original reproduction.

I backported the test to the commit just prior to #2497, and confirmed that the test caught the original breakage.

I wrote the test in JavaScript, because it seemed to have the richest ecosystem of tools for working with the Chrome Devtool Protocol. I originally intended to use Playwright to initiate a CDP connection to workerd, but it seemed to make too many assumptions that it was connecting to a browser. I tried the [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface) library next, and it seemed to work well.

Fixes #1754.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants