EW-9366 Report fetch response event earlier by fhanau · Pull Request #4909 · cloudflare/workerd

fhanau · 2025-08-26T21:03:59Z

This should be reported once the return code is available, not in the destructor.

fhanau · 2025-08-26T21:05:09Z

@danlapid I think you meant something like this in our discussion... the event is reported earlier here, but I'm not sure if we'll always have the IoContext available (needed for an accurate timestamp)?

github-actions · 2025-08-26T21:06:20Z

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

danlapid · 2025-08-26T21:19:04Z

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

fhanau · 2025-08-26T21:45:09Z

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set.
As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

danlapid · 2025-08-26T22:20:31Z

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set. As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

The return event should be available on all handlers! Even those where FetchResponseInfo has no info.
As an event it indicates when the return happened which can be before the invocation is done!
And yes, we should have an IoContext here. Do you have a test failing that assertion?

fhanau · 2025-08-28T17:31:02Z

While this is nice, we need to report the entire tracing::return event over there and on all other event types as well!

Please note that tracing::Return is exclusively used to return tracing::FetchResponseInfo so far, which only exists for fetch events. This is the only code path where fetchStatus is set. As noted, I think that getting the timestamp here right is difficult, did you have anything else in mind with "entire tracing::return event"? If we have a proper timestamp, we can change WorkerTracer::setFetchResponseInfo() to send the return event right away again, but can we be sure that there is an IOContext at this time?

The return event should be available on all handlers! Even those where FetchResponseInfo has no info. As an event it indicates when the return happened which can be before the invocation is done! And yes, we should have an IoContext here. Do you have a test failing that assertion?

Updated the PR (mostly to capture the intent here, there's some cleanup possible and maybe return reporting isn't always in the right spot). Is this what you had in mind?

src/workerd/io/tracer.c++

src/workerd/io/tracer.h

src/workerd/api/http.c++

src/workerd/io/tracer.c++

danlapid

LGTM

src/workerd/server/server.c++

jasnell · 2025-08-29T18:54:20Z

src/workerd/api/tail-worker-test.js

      '{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"fetch","method":"POST","url":"http://placeholder/body-length","headers":[]}}{"type":"return","info":{"type":"fetch","statusCode":200}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
-      '{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":""}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
-      '{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":"* * * * 30"}}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',
+      '{"type":"onset","executionModel":"stateless","scriptTags":[],"info":{"type":"scheduled","scheduledTime":"1970-01-01T00:00:00.000Z","cron":""}}{"type":"return"}{"type":"outcome","outcome":"ok","cpuTime":0,"wallTime":0}',


Just a thought while I'm thinking about (not blocking for this PR)... one of the reasons we have a difference between "return" events and "outcome" events is because of waitUntil tasks... however, we don't currently instrument the fact that waitUntil has tasks at all. I wonder if we should wrap waitUntil draining in a span?

src/workerd/io/worker-entrypoint.c++

src/workerd/io/tracer.h

fhanau · 2025-09-02T18:15:17Z

@danlapid PR has been updated after moving return event reporting into WorkerEntrypoint – tests are passing, some caveats remain:

prewarm() and connect() don't look amenable to adding a return event here (prewarm does nothing, connect doesn't await the waitUntilTasks where the call is placed). I think this is fine, tracing was never supported for these event types as we do not create an onset event for these types.
We're not getting return events in some cases. This happens for tests (doesn't affect prod, we don't need to support them) and when there is an exception (which is arguably the right thing since we don't really return properly), but also for the one queue event in tail-worker-test which I don't fully understand.

danlapid · 2025-09-02T18:55:53Z

@danlapid PR has been updated after moving return event reporting into WorkerEntrypoint – tests are passing, some caveats remain:

prewarm() and connect() don't look amenable to adding a return event here (prewarm does nothing, connect doesn't await the waitUntilTasks where the call is placed). I think this is fine, tracing was never supported for these event types as we do not create an onset event for these types.

We're not getting return events in some cases. This happens for tests (doesn't affect prod, we don't need to support them) and when there is an exception (which is arguably the right thing since we don't really return properly), but also for the one queue event in tail-worker-test which I don't fully understand.

prewarm events should never be traced.
Connect events are currently just a passthrough and we don't enable the connect handler yet. Please add a comment such as

// Whenever we implement incoming connections over the `connect` handler we need to remember to add a tracing `return` event with code such as:
// ....

fhanau · 2025-09-02T19:54:59Z

src/workerd/io/worker-entrypoint.c++

  //   but I'd like to avoid that in the non-test case.
-  return maybeAddGcPassForTest(context, kj::mv(promise));
+  auto result = co_await maybeAddGcPassForTest(context, kj::mv(promise));
+  KJ_IF_SOME(t, context.getWorkerTracer()) {


Somehow, downstream tests segfault in the context.getWorkerTracer() call when run on CI, but pass when running locally

You'll have to figure it out and get this landed

Segfaults have been resolved by using KJ_DEFER here so that the IoContext has not been destructed yet by the time we try to report – I'm not sure if that approach is sound (could it report the return event too early/when there is an exception when we don't want it/still segfault based on a race condition), but that's the only remaining open question for this PR.

The defer works because this function is returning a promise so event->run didn't even start running when you are doing the setReturn.
This does not actually count as working, the return event will fire permaturely.

Yeah, that's one of the concerns I had, but I kinda doubt that's actually the case since it would break our tests – if the return event was sent before we start running the customEvent, it would get sent before the Onset event and just cause tests to fail (or if not that, be at the wrong position in the trace), so the tests suggest that this is being sent at around the right time (not conclusive evidence of course – happy to consider alternatives)

Well it would run right after the first async operation in event->run which is after the onset event was submitted.

@mar-cf the other issue I mentioned.
I can see two workarounds here:

Report the return event after co_awaiting the customEvent, for all events except for queue (which is the one event that caused UaF based on the IoContext no longer being available, for all other event types it worked fine in tests based on how draining works for them). Adding reporting within the queue event should work, but this is still kinda ugly, we'll need a way to tell that we're not in a queue event.

Report the return event separately for each customEvent, within that event. Will require much more maintenance effort, easy to forget when we add a new event type.

Updated to report return event for each customEvent type individually. trace() is not supported properly, but we return there immediately anyway and fixing that is out of scope for this PR.

fhanau · 2025-09-09T20:00:08Z

src/workerd/io/worker-entrypoint.c++

    TRACE_EVENT("workerd", "WorkerEntrypoint::request() deferred proxy step",
        PERFETTO_FLOW_FROM_POINTER(this));
    proxyTask = kj::mv(deferredProxy.proxyTask);
+    KJ_IF_SOME(t, context.getWorkerTracer()) {


One more wrinkle: On a downstream test, this ends up failing the KJ_REQUIRE(KJ_REQUIRE_NONNULL(trace->eventInfo).is<tracing::FetchEventInfo>()); check above, we somehow end up reporting this return event to an alarm event... I guess context ends up pointing to a different worker tracer?

Tried many things to solve this, but can't quite figure out what the right place/time is to report this

Saving a reference to the tracer beforehand seems to work, but there's gotta be a better way.

This approach seems to be acceptable based on internal discussion.

…ent types This should be reported once the return code is available, not in the destructor.

fhanau · 2025-09-19T14:59:11Z

Superseded by #5093 #5068

fhanau requested review from a team as code owners August 26, 2025 21:04

fhanau mentioned this pull request Aug 26, 2025

EW-9366 [o11y] Implement proper timestamp support for STW #4767

Merged

fhanau force-pushed the felix/stw-time-v3 branch from 74352a1 to b684d8c Compare August 26, 2025 21:49

fhanau force-pushed the felix/stw-time-v3 branch from b684d8c to 3613f6f Compare August 28, 2025 17:29