Skip to content

EW-9366 Report fetch response event earlier#4909

Closed
fhanau wants to merge 1 commit intomainfrom
felix/stw-time-v3
Closed

EW-9366 Report fetch response event earlier#4909
fhanau wants to merge 1 commit intomainfrom
felix/stw-time-v3

Conversation

@fhanau
Copy link
Copy Markdown
Contributor

@fhanau fhanau commented Aug 26, 2025

This should be reported once the return code is available, not in the destructor.

@fhanau fhanau requested review from a team as code owners August 26, 2025 21:04
@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 26, 2025

@danlapid I think you meant something like this in our discussion... the event is reported earlier here, but I'm not sure if we'll always have the IoContext available (needed for an accurate timestamp)?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 26, 2025

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@danlapid
Copy link
Copy Markdown
Collaborator

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 26, 2025

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set.
As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

@fhanau fhanau force-pushed the felix/stw-time-v3 branch from 74352a1 to b684d8c Compare August 26, 2025 21:49
@danlapid
Copy link
Copy Markdown
Collaborator

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set. As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

The return event should be available on all handlers! Even those where FetchResponseInfo has no info.
As an event it indicates when the return happened which can be before the invocation is done!
And yes, we should have an IoContext here. Do you have a test failing that assertion?

@fhanau fhanau force-pushed the felix/stw-time-v3 branch from b684d8c to 3613f6f Compare August 28, 2025 17:29
@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 28, 2025

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set. As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

The return event should be available on all handlers! Even those where FetchResponseInfo has no info. As an event it indicates when the return happened which can be before the invocation is done! And yes, we should have an IoContext here. Do you have a test failing that assertion?

Updated the PR (mostly to capture the intent here, there's some cleanup possible and maybe return reporting isn't always in the right spot). Is this what you had in mind?

Copy link
Copy Markdown
Collaborator

@danlapid danlapid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fhanau fhanau force-pushed the felix/stw-time-v3 branch 2 times, most recently from 2b1a262 to 7269f57 Compare August 29, 2025 16:19
'{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"fetch","method":"POST","url":"http://placeholder/body-length","headers":[]}}{"type":"return","info":{"type":"fetch","statusCode":200}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
'{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":""}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
'{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":"* * * * 30"}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
'{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":""}}{"type":"return"}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought while I'm thinking about (not blocking for this PR)... one of the reasons we have a difference between "return" events and "outcome" events is because of waitUntil tasks... however, we don't currently instrument the fact that waitUntil has tasks at all. I wonder if we should wrap waitUntil draining in a span?

@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Sep 2, 2025

@danlapid PR has been updated after moving return event reporting into WorkerEntrypoint – tests are passing, some caveats remain:

  • prewarm() and connect() don't look amenable to adding a return event here (prewarm does nothing, connect doesn't await the waitUntilTasks where the call is placed). I think this is fine, tracing was never supported for these event types as we do not create an onset event for these types.
  • We're not getting return events in some cases. This happens for tests (doesn't affect prod, we don't need to support them) and when there is an exception (which is arguably the right thing since we don't really return properly), but also for the one queue event in tail-worker-test which I don't fully understand.

@danlapid danlapid self-requested a review September 2, 2025 18:46
@danlapid
Copy link
Copy Markdown
Collaborator

danlapid commented Sep 2, 2025

@danlapid PR has been updated after moving return event reporting into WorkerEntrypoint – tests are passing, some caveats remain:

  • prewarm() and connect() don't look amenable to adding a return event here (prewarm does nothing, connect doesn't await the waitUntilTasks where the call is placed). I think this is fine, tracing was never supported for these event types as we do not create an onset event for these types.
  • We're not getting return events in some cases. This happens for tests (doesn't affect prod, we don't need to support them) and when there is an exception (which is arguably the right thing since we don't really return properly), but also for the one queue event in tail-worker-test which I don't fully understand.

prewarm events should never be traced.
Connect events are currently just a passthrough and we don't enable the connect handler yet. Please add a comment such as

// Whenever we implement incoming connections over the `connect` handler we need to remember to add a tracing `return` event with code such as:
// ....

@fhanau fhanau requested a review from mar-cf September 2, 2025 19:18
// but I'd like to avoid that in the non-test case.
return maybeAddGcPassForTest(context, kj::mv(promise));
auto result = co_await maybeAddGcPassForTest(context, kj::mv(promise));
KJ_IF_SOME(t, context.getWorkerTracer()) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow, downstream tests segfault in the context.getWorkerTracer() call when run on CI, but pass when running locally

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to figure it out and get this landed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Segfaults have been resolved by using KJ_DEFER here so that the IoContext has not been destructed yet by the time we try to report – I'm not sure if that approach is sound (could it report the return event too early/when there is an exception when we don't want it/still segfault based on a race condition), but that's the only remaining open question for this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defer works because this function is returning a promise so event->run didn't even start running when you are doing the setReturn.
This does not actually count as working, the return event will fire permaturely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's one of the concerns I had, but I kinda doubt that's actually the case since it would break our tests – if the return event was sent before we start running the customEvent, it would get sent before the Onset event and just cause tests to fail (or if not that, be at the wrong position in the trace), so the tests suggest that this is being sent at around the right time (not conclusive evidence of course – happy to consider alternatives)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it would run right after the first async operation in event->run which is after the onset event was submitted.

Copy link
Copy Markdown
Contributor Author

@fhanau fhanau Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mar-cf the other issue I mentioned.
I can see two workarounds here:

  • Report the return event after co_awaiting the customEvent, for all events except for queue (which is the one event that caused UaF based on the IoContext no longer being available, for all other event types it worked fine in tests based on how draining works for them). Adding reporting within the queue event should work, but this is still kinda ugly, we'll need a way to tell that we're not in a queue event.
  • Report the return event separately for each customEvent, within that event. Will require much more maintenance effort, easy to forget when we add a new event type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to report return event for each customEvent type individually. trace() is not supported properly, but we return there immediately anyway and fixing that is out of scope for this PR.

TRACE_EVENT("workerd", "WorkerEntrypoint::request() deferred proxy step",
PERFETTO_FLOW_FROM_POINTER(this));
proxyTask = kj::mv(deferredProxy.proxyTask);
KJ_IF_SOME(t, context.getWorkerTracer()) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more wrinkle: On a downstream test, this ends up failing the KJ_REQUIRE(KJ_REQUIRE_NONNULL(trace->eventInfo).is<tracing::FetchEventInfo>()); check above, we somehow end up reporting this return event to an alarm event... I guess context ends up pointing to a different worker tracer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried many things to solve this, but can't quite figure out what the right place/time is to report this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving a reference to the tracer beforehand seems to work, but there's gotta be a better way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems to be acceptable based on internal discussion.

…ent types

This should be reported once the return code is available, not in the
destructor.
@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Sep 19, 2025

Superseded by #5093 #5068

@fhanau fhanau closed this Sep 19, 2025
@fhanau fhanau deleted the felix/stw-time-v3 branch September 20, 2025 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants