Skip to content

Defer TCP socket connect() until DO output gate clears.#6533

Merged
vaibhavSharma-cf merged 2 commits into
mainfrom
vaibhav/socket-output-gate
Apr 13, 2026
Merged

Defer TCP socket connect() until DO output gate clears.#6533
vaibhavSharma-cf merged 2 commits into
mainfrom
vaibhav/socket-output-gate

Conversation

@vaibhavSharma-cf

@vaibhavSharma-cf vaibhavSharma-cf commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Defer TCP socket connect() until DO output gate clears.

Approach:
Wrap the WorkerInterface in newPromisedWorkerInterface() behind the TCP_SOCKET_CONNECT_OUTPUT_GATE autogate. The promised interface defers connect() until IoContext::waitForOutputLocks() resolves, preventing premature network output while DO storage writes are still pending.

@vaibhavSharma-cf vaibhavSharma-cf requested review from a team as code owners April 8, 2026 19:21
@vaibhavSharma-cf vaibhavSharma-cf self-assigned this Apr 8, 2026
@vaibhavSharma-cf

Copy link
Copy Markdown
Contributor Author

/bonk review

ask-bonk[bot]
ask-bonk Bot previously requested changes Apr 8, 2026

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

  • [HIGH] Use-after-free: addressStr captured as non-owning kj::StringPtr in coroutine, then the owning kj::String is moved away — the coroutine will dereference freed memory when it resumes after waitForOutputLocks().
  • [MEDIUM] ConnectionData member ordering: connectTask is destroyed after tlsStarter — while currently safe due to coroutine cancellation semantics, this contradicts the stated intent and is fragile against future changes. Moving connectTask after tlsStarter would make the safety explicit.
  • [LOW] Stale comment on connectImpl — the original TODO was removed but the replacement comment is misleading about where the gate wait actually occurs.

This review was generated by an AI assistant and may contain errors. Please verify all findings independently.

Comment thread src/workerd/api/sockets.c++ Outdated
Comment thread src/workerd/api/sockets.h Outdated
Comment thread src/workerd/api/sockets.c++
@ask-bonk

ask-bonk Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Review posted on PR #6533 with 3 inline comments. Here's a summary of the findings:

[HIGH] Use-after-free on addressStr (sockets.c++:299): The doConnect coroutine captures addressStr as a non-owning kj::StringPtr, but the owning kj::String is moved into setupSocket() on line 323 after the coroutine is created. When the coroutine resumes after co_await ioContext.waitForOutputLocks(), it dereferences a dangling pointer. Fix: have the coroutine take ownership of the string (change parameter to kj::String).

[MEDIUM] ConnectionData member destruction order (sockets.h:189-192): The declared order puts connectTask before tlsStarter, meaning tlsStarter is destroyed first (reverse order). This contradicts the PR description's stated intent of cancelling the connect task before tlsStarter is destroyed. While currently safe due to coroutine cancellation semantics, the ordering should be swapped to make the safety guarantee explicit and resilient to future changes.

[LOW] Misleading function name (sockets.c++:354): The comment on connectImpl explains the output gate is now handled inside connectImplNoOutputLock, but the function name implies it does not handle output locks.

github run

Comment thread src/workerd/api/tests/socket-output-gate-test.wd-test Outdated
Comment thread src/workerd/api/sockets.c++ Outdated
dom96
dom96 previously requested changes Apr 9, 2026

@dom96 dom96 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for doing this? Why do we need to prevent network outputs while storage writes are pending?

It also doesn't seem like that is what is happening in this PR. It is actually blocking socket connections completely, shouldn't only writes be blocked instead? Focusing on blocking writes may also simplify the implementation, since connect itself is synchronous, so the code to block it is quite involved and requires these fulfillers (which I don't think are ideal).

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch 2 times, most recently from 2093ffb to 828423e Compare April 9, 2026 15:10
@kentonv

kentonv commented Apr 9, 2026

Copy link
Copy Markdown
Member

What is the rationale for doing this? Why do we need to prevent network outputs while storage writes are pending?

That's how output gates work. We don't want applications to have to wait for writes to be confirmed durable before we allow them to continue executing. Instead, we make it impossible to observe that the app has continued executing until the write has become durable. If the write ends up failing, then we prevent the rest of the world from ever knowing that the app kept running -- so from the PoV of the rest of the world, the app stopped exactly where the failed write was performed. This is a huge optimization because it allows an app to perform several writes in rapid succession with only one round trip to durable storage (SRS followers).

https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/

shouldn't only writes be blocked instead?

Anything that can be observed by the outside world needs to be blocked. That means we have to block both writes and connection initiation -- the remote end can observe that it has received a connection, and in theory you could imagine a protocol where merely connecting to a remote service serves to notify it of something, thus having side effects.

Comment thread src/workerd/api/sockets.c++ Outdated
@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch from 828423e to 149fae3 Compare April 10, 2026 11:23
@vaibhavSharma-cf

Copy link
Copy Markdown
Contributor Author

/bonk review

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

All three findings from the previous review are resolved by the rewrite to use newPromisedWorkerInterface().

  • [RESOLVED] Previously HIGH: use-after-free on addressStr — eliminated by removing the coroutine entirely. The HttpClientAdapter copies host and headers internally, so the PromisedWorkerInterface::connect() coroutine operates on the adapter's owned copy, not on the caller's addressStr.
  • [RESOLVED] Previously MEDIUM: ConnectionData destruction order — the connectTask field was removed entirely.
  • [RESOLVED] Previously LOW: misleading function name — acknowledged with a TODO(cleanup) comment.

The newPromisedWorkerInterface() approach is correct. The adapter defers the WorkerInterface::connect() call until waitForOutputLocks() resolves, while the KJ HTTP layer handles all string ownership concerns internally. The test-fixture refactoring to expose DummyIoChannelFactory for subclassing is a reasonable improvement.

Two minor nits posted inline.


This review was generated by an AI assistant and may contain errors. Please verify all findings independently.

Comment thread src/workerd/api/sockets.c++ Outdated
Comment thread src/workerd/api/sockets-test.c++ Outdated
@ask-bonk

ask-bonk Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Review posted on PR #6533 with 2 inline comments. Here's a summary:

All three previous findings are resolved by the rewrite to use newPromisedWorkerInterface():

  • Previously HIGH (use-after-free) — eliminated entirely. The HttpClientAdapter copies host/headers internally, so the deferred coroutine operates on owned copies.
  • Previously MEDIUM (destruction order) — the connectTask field was removed.
  • Previously LOW (misleading name) — acknowledged with TODO(cleanup).

New findings (both LOW):

  1. Comment on sockets.c++:261 exceeds typical line width — suggested wrapping.
  2. Lambda body indentation in sockets-test.c++:80-82 is inconsistent — suggested fix.

The newPromisedWorkerInterface() approach is correct and much simpler than the previous fulfiller-based approach. The KJ HTTP adapter layer handles all string ownership internally, making the code safe without any manual lifetime management.

github run

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch 7 times, most recently from 7b367f0 to d9ddc73 Compare April 10, 2026 15:20
@vaibhavSharma-cf

Copy link
Copy Markdown
Contributor Author

/bonk review

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The newPromisedWorkerInterface() approach is clean and correct — significantly simpler than the earlier fulfiller-based design. The core deferral logic is sound, and the error-preservation changes in handleProxyStatus are a necessary consequence of making connect errors asynchronous.

Findings:

  • [MEDIUM] handleProxyStatus error-preservation pattern is fragile (sockets.c++:434-443) — heap-allocated originalError accessed via raw reference across lambda boundaries; safe today but the lifetime dependency between errorHandler and func is implicit and easy to break in future refactors.
  • [MEDIUM] waitForOutputLocksIfNecessary() should be preferred over waitForOutputLocks() (sockets.c++:263) — avoids unnecessary promise allocation for non-actor contexts where the gate is always open.
  • [LOW] Test co_await kj::evalLater robustness (sockets-test.c++:104-108) — a single event loop turn is likely sufficient given KJ's synchronous continuation resolution, but worth a brief comment explaining why.
  • [LOW] KJ_EXPECT_LOG branch duplication in server-test (server-test.c++:1258-1269) — only the KJ_EXPECT_LOG line differs between branches; the test connection logic is duplicated.

This review was generated by an AI assistant and may contain errors. Please verify all findings independently.

Comment thread src/workerd/api/sockets.c++ Outdated
Comment thread src/workerd/api/sockets.c++ Outdated
Comment thread src/workerd/api/sockets-test.c++
Comment thread src/workerd/server/server-test.c++ Outdated
@ask-bonk

ask-bonk Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Review posted on PR #6533 with 4 inline comments. Here's the summary:

All three findings from my previous reviews are resolved by the rewrite to newPromisedWorkerInterface().

New findings (2 MEDIUM, 2 LOW):

  1. [MEDIUM] handleProxyStatus error-preservation pattern (sockets.c++:434-443) — uses a heap-allocated Maybe<kj::Exception> accessed via raw reference across two lambdas. Safe today because awaitIo guarantees func outlives errorHandler, but the lifetime dependency is implicit. Also, TypeErrors from rejected connects now get logged as "Socket proxy disconnected abruptly" when the autogate is on, which is misleading.

  2. [MEDIUM] waitForOutputLocks() vs waitForOutputLocksIfNecessary() (sockets.c++:263) — the current code always allocates a promise even in non-actor contexts. Minor optimization opportunity, though READY_NOW is essentially free.

  3. [LOW] Test co_await kj::evalLater pattern (sockets-test.c++:104-108) — correct but could use a comment explaining why a single turn suffices.

  4. [LOW] Server-test branch duplication (server-test.c++:1258-1269) — the KJ_EXPECT_LOG can be hoisted before the shared connection logic to eliminate the else branch.

github run

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch from d9ddc73 to 59a7f82 Compare April 10, 2026 18:39
@vaibhavSharma-cf

Copy link
Copy Markdown
Contributor Author

/bonk review

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

All three findings from my earlier CHANGES_REQUESTED review are resolved by the rewrite to newPromisedWorkerInterface(). The approach is significantly cleaner — the KJ HTTP adapter layer handles string ownership internally, and there are no manual lifetime puzzles.

New findings:

  • [MEDIUM] handleProxyStatus error-preservation pattern (sockets.c++:434-443) — heap-allocated Maybe<kj::Exception> accessed via raw reference across lambda boundaries. Safe today but the lifetime dependency is implicit. Also, TypeErrors from rejected connects now go through the "Socket proxy disconnected abruptly" log path, which is misleading.
  • [LOW] Server-test branch duplication (server-test.c++:1258-1269) — only KJ_EXPECT_LOG differs between branches; hoist it before the shared connection logic.
  • [LOW] Test lambda indentation (sockets-test.c++:82-83) — body aligns with SetupParams brace instead of being indented within the lambda.

This review was generated by an AI assistant and may contain errors. Please verify all findings independently.

Comment thread src/workerd/api/sockets.c++ Outdated
Comment thread src/workerd/server/server-test.c++ Outdated
Comment thread src/workerd/api/sockets-test.c++
@ask-bonk

ask-bonk Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Review posted successfully on PR #6533 with 3 inline comments. Here's a summary of what was posted:

Resolved from prior reviews:

  • Previously HIGH (use-after-free on addressStr) — eliminated by the newPromisedWorkerInterface() rewrite
  • Previously MEDIUM (ConnectionData destruction order) — connectTask field removed entirely
  • Previously LOW (misleading function name) — acknowledged with TODO(cleanup)

New findings posted:

  1. [MEDIUM] handleProxyStatus error-preservation pattern (sockets.c++:443) — the originalErrorRef raw reference captured by errorHandler has an implicit lifetime dependency on func owning originalError. Safe today but fragile. Also noted that TypeErrors now go through the misleading "Socket proxy disconnected abruptly" log path.
  2. [LOW] Server-test branch duplication (server-test.c++:1258-1269) — the KJ_EXPECT_LOG can be hoisted before the shared connection logic to eliminate the else branch.
  3. [LOW] Test lambda indentation (sockets-test.c++:81-84) — inconsistent alignment with suggested fix.

github run

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch 5 times, most recently from 7185e31 to c97dba6 Compare April 13, 2026 12:33
Wrap the WorkerInterface in newPromisedWorkerInterface() behind the
TCP_SOCKET_CONNECT_OUTPUT_GATE autogate. The promised interface defers
connect() until IoContext::waitForOutputLocks()
resolves, preventing premature network output while DO storage writes
are still pending.
@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/socket-output-gate branch from c97dba6 to a375586 Compare April 13, 2026 13:41
@vaibhavSharma-cf vaibhavSharma-cf enabled auto-merge (squash) April 13, 2026 16:18
@vaibhavSharma-cf vaibhavSharma-cf dismissed dom96’s stale review April 13, 2026 16:23

Kenton has answered the query regarding why we want waiting on lock at connect level as well.

@vaibhavSharma-cf vaibhavSharma-cf dismissed ask-bonk[bot]’s stale review April 13, 2026 16:24

all comments either addressed or outdated.

@github-actions

github-actions Bot commented Apr 13, 2026

Copy link
Copy Markdown

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@vaibhavSharma-cf vaibhavSharma-cf merged commit 50c021a into main Apr 13, 2026
33 of 34 checks passed
@vaibhavSharma-cf vaibhavSharma-cf deleted the vaibhav/socket-output-gate branch April 13, 2026 17:32
@kentonv

kentonv commented Apr 13, 2026

Copy link
Copy Markdown
Member

Hmm I was not expecting my approval to dismiss @dom96's review, I was expecting him to take another look and say if he is happy with it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants