fix(spanner): replace multiplexed session request loop with shared in-flight creation#14215
fix(spanner): replace multiplexed session request loop with shared in-flight creation#14215
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds two new tests to verify a fix for multiplexed session creation, specifically addressing potential deadlocks when a request is cancelled or the client is closed. The tests cover important concurrency scenarios. I've found a logical issue in one of the new tests, TestMultiplexedSessionCreationCloseReleasesWaiters, where the test setup would lead to a timeout instead of correctly testing the intended scenario. My feedback includes a correction for this test.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request provides a solid fix for the deadlock issue in multiplexed session creation. The move from a channel-based request loop to a shared in-flight creation latch (multiplexedSessionCreation) significantly simplifies the concurrency logic and makes it more robust against context cancellations and client shutdown. The new implementation is much easier to follow and reason about.
I have one suggestion regarding a new test case in spanner/session_test.go, which appears to have some stale comments and a misleading name from a previous implementation. Cleaning this up would improve the long-term maintainability of the test suite.
Overall, this is an excellent improvement.
olavloite
left a comment
There was a problem hiding this comment.
LGTM, with an open question regarding the upper bound for the timeout for CreateSession. I'm fine with leaving this as it is for now, as CreateSession should never take more than 30s. But I do think that we should at least do a follow-up to fix it, as it is not logical.
spanner/session.go
Outdated
| if !force && p.multiplexedSession != nil { | ||
| return nil | ||
| } | ||
| ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) |
There was a problem hiding this comment.
This will effectively make 30 seconds the upper bound for a timeout for the CreateSession RPC. That is: If someone configures CreateSession to have a 60 second timeout, then that will be ignored, and this 30s value will effectively be used instead. Is that intentional?
There was a problem hiding this comment.
Thanks for pointing this out, this got copied from the old code which used 30s for the startup/refresh force create of mux session, the new code used 30s for every shared multiplexed-session creation epoch.
Removed it now,
- the manager still owns cancellation for shutdown
- CreateSession timeout policy now comes only from the configured GAPIC call options
- default behavior remains 30s(default gax), but a user-configured longer CreateSession timeout is no longer capped at 30s by the manager layer.
PR created by the Librarian CLI to initialize a release. Merging this PR will auto trigger a release. Librarian Version: v0.8.4 Language Image: us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:ac1efa3ad3c6d99efed878535b3a0fe63d0cce6701c335f03811d8b49004d652 <details><summary>spanner: v1.89.0</summary> ## [v1.89.0](spanner/v1.88.0...spanner/v1.89.0) (2026-03-26) ### Features * Add E2E fallback to the spanner client. (#13518) ([16af6a1](16af6a1c)) * include cache updates and routing hint into BeginTransaction and Commit request/response respectively (PiperOrigin-RevId: 878019893) ([9c80b8b](9c80b8b4)) * add SI, adapt, split point related proto (PiperOrigin-RevId: 871366927) ([d3eb851](d3eb851d)) * Add gRPC A66/A94 metrics (#13825) ([d695802](d695802a)) * support Scan from string to NullUUID (#14128) ([d897b6d](d897b6db)) ### Bug Fixes * replace multiplexed session request loop with shared in-flight creation (#14215) ([3e3bd2d](3e3bd2d3)) * guard rollback when aborted commit cleared session handle (#14218) ([6315105](63151055)) ### Documentation * A comment for field `routing_hint` in messages `.google.spanner.v1.ResultSet` and `.google.spanner.v1.PartialResultSet` are changed (PiperOrigin-RevId: 878019893) ([9c80b8b](9c80b8b4)) * A comment in message `.google.spanner.v1.TransactionOptions.ReadWrite.ReadLockMode` is changed (PiperOrigin-RevId: 878019893) ([9c80b8b](9c80b8b4)) * A comment for message `ListCloudInstancesAction` is changed (PiperOrigin-RevId: 871366927) ([d3eb851](d3eb851d)) * A comment for field `execution_options` in message `.google.spanner.executor.v1.StartTransactionAction` is changed (PiperOrigin-RevId: 871366927) ([d3eb851](d3eb851d)) * A comment for message `TransactionExecutionOptions` is changed (PiperOrigin-RevId: 871366927) ([d3eb851](d3eb851d)) </details>
Fixes: #14212
Reworked multiplexed session initialization to use a shared in-flight creation latch instead of the old request
goroutine and unbuffered channels. This fixes the startup deadlock triggered by a cancelled read-only transaction
waiter and adds regression coverage for cancellation and client shutdown while session creation is in flight.
sessionManager used an unbuffered multiplexedSessionReq channel and a single goroutine to coordinate multiplexed session creation. A cancelled waiter could cause that goroutine to return, leaving future callers blocked forever trying to send on the request channel during takeMultiplexed.
Replace that request/ack flow with a shared in-flight creation record on sessionManager. When no multiplexed session exists, the manager starts one background creation with a manager-owned context and exposes a done channel for all waiters in that creation epoch. Waiters now block on done or their own context, and completion is broadcast by closing the shared channel. This removes the deadlock where one cancelled context could kill shared coordination, and it also lets Client.Close release any waiters blocked behind an in-flight initial create.