Skip to content

[o11y] Defer STW outcome event reporting#4689

Merged
fhanau merged 1 commit intomainfrom
felix/080525-defer-outcome-reporting
Aug 13, 2025
Merged

[o11y] Defer STW outcome event reporting#4689
fhanau merged 1 commit intomainfrom
felix/080525-defer-outcome-reporting

Conversation

@fhanau
Copy link
Copy Markdown
Contributor

@fhanau fhanau commented Aug 5, 2025

This should address the 'reported tail stream event after stream close' warnings seen in testing.

@fhanau fhanau requested a review from mar-cf August 5, 2025 18:31
@fhanau fhanau requested review from a team as code owners August 5, 2025 18:31
@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 6, 2025

@danlapid @jasnell any questions/concerns here? Based on Sentry reporting this should address the issue at hand here – fetch spans are sometimes being reported after the outcome event, likely when the span only wraps up after a worker invocation has returned (I tried to cause this in a test using ctx.waitUntil(), but couldn't actually reproduce it). I think this is best addressed by deferring outcome event reporting.

@danlapid
Copy link
Copy Markdown
Collaborator

danlapid commented Aug 6, 2025

Can't say I believe in or like this change.
I get that it seems to resolve this issue but this is just more guesswork.
As I said previously, tracing is an IO based thing.
If the WorkerTracer lives longer than the IoContext (onset happens before/as we construct the IoContext, outcome happens after/as we destruct the IoContext) then all of these problems should be gone for good.
I'd rather see this represented clearly in the IoContext class or in a class encompassing the IoContext.

@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 7, 2025

Can't say I believe in or like this change. I get that it seems to resolve this issue but this is just more guesswork. As I said previously, tracing is an IO based thing. If the WorkerTracer lives longer than the IoContext (onset happens before/as we construct the IoContext, outcome happens after/as we destruct the IoContext) then all of these problems should be gone for good. I'd rather see this represented clearly in the IoContext class or in a class encompassing the IoContext.

Kindly note that the WorkerTracer already outlives the IoContext (or rather all events that can be reported from it) since IoContext_IncomingRequest owns WorkerTracer. (Otherwise we would not be able to print the warning in question here, which is reported within WorkerTracer).
An elegant way to avoid this issue would be to delay when the outcome event is reported to WorkerTracer – this currently happens in the destructor of a RequestObserver (RequestObserverWithTracer for workerd and its internal equivalent downstream). If we were to move this reporting to another class or change when this class gets deallocated that could work, but I believe that either would be a significant lift.

@danlapid
Copy link
Copy Markdown
Collaborator

danlapid commented Aug 8, 2025

Can't say I believe in or like this change. I get that it seems to resolve this issue but this is just more guesswork. As I said previously, tracing is an IO based thing. If the WorkerTracer lives longer than the IoContext (onset happens before/as we construct the IoContext, outcome happens after/as we destruct the IoContext) then all of these problems should be gone for good. I'd rather see this represented clearly in the IoContext class or in a class encompassing the IoContext.

Kindly note that the WorkerTracer already outlives the IoContext (or rather all events that can be reported from it) since IoContext_IncomingRequest owns WorkerTracer. (Otherwise we would not be able to print the warning in question here, which is reported within WorkerTracer).

An elegant way to avoid this issue would be to delay when the outcome event is reported to WorkerTracer – this currently happens in the destructor of a RequestObserver (RequestObserverWithTracer for workerd and its internal equivalent downstream). If we were to move this reporting to another class or change when this class gets deallocated that could work, but I believe that either would be a significant lift.

Sounds like changing it so the IoContext destructor tells the WorkerTracer to report the outcome event would be simple then.
Since it already is outlived by it.

Let's go with that.

@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 11, 2025

Sounds like changing it so the IoContext destructor tells the WorkerTracer to report the outcome event would be simple then. Since it already is outlived by it.

Let's go with that.

Based on local testing, I believe this can't work as intended: In the common case, the IoContext/IoContext_IncomingRequest shuts down before metrics get reported (at least in workerd). The WorkerTracer is still being kept alive based on being co-owned by the RequestObserver though, and the outcome gets reported when the RequestObserver shuts down (right before WorkerTracer is deallocated). In this case we can't dispatch the outcome event when the IoContext shuts down because it is simply not available yet.
In the case this PR seeks to address, the RequestObserver is destructed before the IoContext and accordingly reports the outcome event at that time, the IoContext may still add more events so we need to defer dispatching the outcome.

@danlapid
Copy link
Copy Markdown
Collaborator

Sounds like changing it so the IoContext destructor tells the WorkerTracer to report the outcome event would be simple then. Since it already is outlived by it.
Let's go with that.

Based on local testing, I believe this can't work as intended: In the common case, the IoContext/IoContext_IncomingRequest shuts down before metrics get reported (at least in workerd). The WorkerTracer is still being kept alive based on being co-owned by the RequestObserver though, and the outcome gets reported when the RequestObserver shuts down (right before WorkerTracer is deallocated). In this case we can't dispatch the outcome event when the IoContext shuts down because it is simply not available yet. In the case this PR seeks to address, the RequestObserver is destructed before the IoContext and accordingly reports the outcome event at that time, the IoContext may still add more events so we need to defer dispatching the outcome.

I understand that this PR seems like a simple fix to you that addresses the immediate need but this comment just proves how convoluted the request lifecycle is.
We cannot continue like this.
I have yet to hear any valid logical reason why the outcome can change after the IoContext is destructed.
The fact that it could is a bug.

@fhanau
Copy link
Copy Markdown
Contributor Author

fhanau commented Aug 11, 2025

I understand that this PR seems like a simple fix to you that addresses the immediate need but this comment just proves how convoluted the request lifecycle is. We cannot continue like this. I have yet to hear any valid logical reason why the outcome can change after the IoContext is destructed. The fact that it could is a bug.

I'm not aware of the outcome changing after the IoContext shuts down here, or that it would need to be supported. What I see happening is that the order in which RequestObserver and IoContext shut down is not fixed, which puts us into this position. Do you think that needs to be changed so that the RequestObserver destructs first? Based on a brief look on the internal and workerd implementations, it looks to me like the outcome in RequestObserver is being set by RequestObserver::reportFailure, RequestObserver::setOutcome and when checking the return value of invoking a (custom)event. To me it looks like the issue with when outcome is reported is inherent in this design and hard to avoid, please let me know if you can think of a workaround/new approach.

@danlapid
Copy link
Copy Markdown
Collaborator

Ok looked together.
Our codebase is too much of a mess at the moment.
We should clean this up later.

@fhanau fhanau force-pushed the felix/080525-defer-outcome-reporting branch 2 times, most recently from cafc206 to ae1ef21 Compare August 13, 2025 18:20
This should address the 'reported tail stream event after stream close' warnings
seen in testing.
@fhanau fhanau force-pushed the felix/080525-defer-outcome-reporting branch from ae1ef21 to 829156e Compare August 13, 2025 19:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 13, 2025

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@fhanau fhanau merged commit dbce470 into main Aug 13, 2025
32 of 33 checks passed
@fhanau fhanau deleted the felix/080525-defer-outcome-reporting branch August 13, 2025 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants