Defer TCP socket connect() until DO output gate clears.#6533
Conversation
|
/bonk review |
There was a problem hiding this comment.
Review Summary
- [HIGH] Use-after-free:
addressStrcaptured as non-owningkj::StringPtrin coroutine, then the owningkj::Stringis moved away — the coroutine will dereference freed memory when it resumes afterwaitForOutputLocks(). - [MEDIUM]
ConnectionDatamember ordering:connectTaskis destroyed aftertlsStarter— while currently safe due to coroutine cancellation semantics, this contradicts the stated intent and is fragile against future changes. MovingconnectTaskaftertlsStarterwould make the safety explicit. - [LOW] Stale comment on
connectImpl— the original TODO was removed but the replacement comment is misleading about where the gate wait actually occurs.
This review was generated by an AI assistant and may contain errors. Please verify all findings independently.
|
Review posted on PR #6533 with 3 inline comments. Here's a summary of the findings: [HIGH] Use-after-free on [MEDIUM] [LOW] Misleading function name ( |
dom96
left a comment
There was a problem hiding this comment.
What is the rationale for doing this? Why do we need to prevent network outputs while storage writes are pending?
It also doesn't seem like that is what is happening in this PR. It is actually blocking socket connections completely, shouldn't only writes be blocked instead? Focusing on blocking writes may also simplify the implementation, since connect itself is synchronous, so the code to block it is quite involved and requires these fulfillers (which I don't think are ideal).
2093ffb to
828423e
Compare
That's how output gates work. We don't want applications to have to wait for writes to be confirmed durable before we allow them to continue executing. Instead, we make it impossible to observe that the app has continued executing until the write has become durable. If the write ends up failing, then we prevent the rest of the world from ever knowing that the app kept running -- so from the PoV of the rest of the world, the app stopped exactly where the failed write was performed. This is a huge optimization because it allows an app to perform several writes in rapid succession with only one round trip to durable storage (SRS followers). https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/
Anything that can be observed by the outside world needs to be blocked. That means we have to block both writes and connection initiation -- the remote end can observe that it has received a connection, and in theory you could imagine a protocol where merely connecting to a remote service serves to notify it of something, thus having side effects. |
828423e to
149fae3
Compare
|
/bonk review |
There was a problem hiding this comment.
Review Summary
All three findings from the previous review are resolved by the rewrite to use newPromisedWorkerInterface().
- [RESOLVED] Previously HIGH: use-after-free on
addressStr— eliminated by removing the coroutine entirely. TheHttpClientAdaptercopieshostandheadersinternally, so thePromisedWorkerInterface::connect()coroutine operates on the adapter's owned copy, not on the caller'saddressStr. - [RESOLVED] Previously MEDIUM:
ConnectionDatadestruction order — theconnectTaskfield was removed entirely. - [RESOLVED] Previously LOW: misleading function name — acknowledged with a
TODO(cleanup)comment.
The newPromisedWorkerInterface() approach is correct. The adapter defers the WorkerInterface::connect() call until waitForOutputLocks() resolves, while the KJ HTTP layer handles all string ownership concerns internally. The test-fixture refactoring to expose DummyIoChannelFactory for subclassing is a reasonable improvement.
Two minor nits posted inline.
This review was generated by an AI assistant and may contain errors. Please verify all findings independently.
|
Review posted on PR #6533 with 2 inline comments. Here's a summary: All three previous findings are resolved by the rewrite to use
New findings (both LOW):
The |
7b367f0 to
d9ddc73
Compare
|
/bonk review |
There was a problem hiding this comment.
Review Summary
The newPromisedWorkerInterface() approach is clean and correct — significantly simpler than the earlier fulfiller-based design. The core deferral logic is sound, and the error-preservation changes in handleProxyStatus are a necessary consequence of making connect errors asynchronous.
Findings:
- [MEDIUM]
handleProxyStatuserror-preservation pattern is fragile (sockets.c++:434-443) — heap-allocatedoriginalErroraccessed via raw reference across lambda boundaries; safe today but the lifetime dependency betweenerrorHandlerandfuncis implicit and easy to break in future refactors. - [MEDIUM]
waitForOutputLocksIfNecessary()should be preferred overwaitForOutputLocks()(sockets.c++:263) — avoids unnecessary promise allocation for non-actor contexts where the gate is always open. - [LOW] Test
co_await kj::evalLaterrobustness (sockets-test.c++:104-108) — a single event loop turn is likely sufficient given KJ's synchronous continuation resolution, but worth a brief comment explaining why. - [LOW]
KJ_EXPECT_LOGbranch duplication in server-test (server-test.c++:1258-1269) — only theKJ_EXPECT_LOGline differs between branches; the test connection logic is duplicated.
This review was generated by an AI assistant and may contain errors. Please verify all findings independently.
|
Review posted on PR #6533 with 4 inline comments. Here's the summary: All three findings from my previous reviews are resolved by the rewrite to New findings (2 MEDIUM, 2 LOW):
|
d9ddc73 to
59a7f82
Compare
|
/bonk review |
There was a problem hiding this comment.
Review Summary
All three findings from my earlier CHANGES_REQUESTED review are resolved by the rewrite to newPromisedWorkerInterface(). The approach is significantly cleaner — the KJ HTTP adapter layer handles string ownership internally, and there are no manual lifetime puzzles.
New findings:
- [MEDIUM]
handleProxyStatuserror-preservation pattern (sockets.c++:434-443) — heap-allocatedMaybe<kj::Exception>accessed via raw reference across lambda boundaries. Safe today but the lifetime dependency is implicit. Also, TypeErrors from rejected connects now go through the "Socket proxy disconnected abruptly" log path, which is misleading. - [LOW] Server-test branch duplication (
server-test.c++:1258-1269) — onlyKJ_EXPECT_LOGdiffers between branches; hoist it before the shared connection logic. - [LOW] Test lambda indentation (
sockets-test.c++:82-83) — body aligns withSetupParamsbrace instead of being indented within the lambda.
This review was generated by an AI assistant and may contain errors. Please verify all findings independently.
|
Review posted successfully on PR #6533 with 3 inline comments. Here's a summary of what was posted: Resolved from prior reviews:
New findings posted:
|
7185e31 to
c97dba6
Compare
Wrap the WorkerInterface in newPromisedWorkerInterface() behind the TCP_SOCKET_CONNECT_OUTPUT_GATE autogate. The promised interface defers connect() until IoContext::waitForOutputLocks() resolves, preventing premature network output while DO storage writes are still pending.
c97dba6 to
a375586
Compare
Kenton has answered the query regarding why we want waiting on lock at connect level as well.
all comments either addressed or outdated.
|
The generated output of |
|
Hmm I was not expecting my approval to dismiss @dom96's review, I was expecting him to take another look and say if he is happy with it now. |
Defer TCP socket connect() until DO output gate clears.
Approach:
Wrap the WorkerInterface in newPromisedWorkerInterface() behind the TCP_SOCKET_CONNECT_OUTPUT_GATE autogate. The promised interface defers connect() until IoContext::waitForOutputLocks() resolves, preventing premature network output while DO storage writes are still pending.