Skip to content

fix(datastore): detach rollback context from transaction cancellation#14602

Merged
bhshkh merged 1 commit into
googleapis:mainfrom
bhshkh:fix/ds-retry-rollback-ctx
May 19, 2026
Merged

fix(datastore): detach rollback context from transaction cancellation#14602
bhshkh merged 1 commit into
googleapis:mainfrom
bhshkh:fix/ds-retry-rollback-ctx

Conversation

@bhshkh

@bhshkh bhshkh commented May 16, 2026

Copy link
Copy Markdown
Contributor

Fix a lock contention and high-latency issue in RunInTransaction caused by failing to send Rollback commands when a context timer expires
or is canceled.

Fixes: b/346474631

@product-auto-label product-auto-label Bot added the api: datastore Issues related to the Datastore API. label May 16, 2026
@bhshkh

bhshkh commented May 16, 2026

Copy link
Copy Markdown
Contributor Author

Background (How Datastore Concurrency Works)

When a client updates records inside a Datastore transaction, the Datastore backend places an invisible transaction lock on those records. This prevents other transactions from making concurrent/conflicting changes.

These locks must be released as quickly as possible to avoid "database traffic jams":

  1. If the transaction succeeds, it commits the changes and releases the lock.
  2. If the transaction fails or is canceled, the client library must send a "Rollback" RPC. This tells Datastore to discard the changes and release the lock immediately.

The Bug

When users set a transaction timeout (e.g., using a 5-second context limit) and a Commit is slow or hangs:

  1. The 5-second limit hits. The client correctly aborts the Commit.
  2. The SDK catches this failure and attempts to run Rollback to release the locks.
  3. The problem: The library attempted to send the Rollback RPC using the SAME parent context that had just expired. Since the context was already dead, our networking layer refused to send the Rollback out to the server.
  4. Because the server never received a Rollback, it kept the locks active for up to 103 seconds (until a backend master timeout cleaned them up).
  5. During this time, any new transaction trying to touch those records got blocked, piled up, and failed after 20 seconds with "too much contention" errors.

The Solution

This PR fixes the bug by ensuring the Rollback RPC always goes out, even if the parent context is cancelled or expired:

  • Detached Context: We now wrap the rollback context in context.WithoutCancel(t.ctx). This creates a fresh, healthy context that maintains critical tracing metadata but ignores parental cancellation.
  • Safe Timeout: We bound the rollback context with an independent 5-second limit so that it cannot block your application indefinitely if the network is completely broken.
  • Immediate Lock Release: The Rollback RPC successfully reaches the server, properly releasing locks instantly, enabling subsequent retries to proceed without waiting.

@bhshkh bhshkh marked this pull request as ready for review May 16, 2026 00:56
@bhshkh bhshkh requested review from a team as code owners May 16, 2026 00:56

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the rollbackWithRetry method in datastore/transaction.go to use a detached context with a 5-second timeout, ensuring that rollback RPCs are executed even if the original context is cancelled or timed out. A corresponding test case, TestRunInTransaction_ContextCanceled, has been added to verify that rollbacks occur correctly under context cancellation. I have no feedback to provide.

Comment thread datastore/transaction.go
origCtx := t.ctx
t.ctx = rollbackCtx
defer func() {
t.ctx = origCtx

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic for why the rollbackCtx is needed, but why do we have to reset t.ctx back to the original context?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Avoid Side-Effects: rollbackCtx is cancelled immediately after rollbackWithRetry returns (via defer cancel()). If we don't restore t.ctx, the Transaction struct would be left holding a reference to a cancelled context.
  2. Clearer Errors on Misuse: If the transaction is somehow reused after a failed rollback (where state might not be marked as expired), restoring origCtx ensures subsequent operations run under the user's original context (which might still be active) rather than failing immediately with a confusing context canceled error.
  3. Memory Management: It prevents keeping the temporary rollbackCtx and its wrapper in memory if the Transaction object is held elsewhere.

@bhshkh bhshkh merged commit 0a8d10d into googleapis:main May 19, 2026
11 checks passed
@bhshkh bhshkh deleted the fix/ds-retry-rollback-ctx branch May 19, 2026 16:05
bhshkh added a commit that referenced this pull request May 19, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.13.2-0.20260518181009-04b8e642ea4c
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:b04b076f5eedbb5546bd6fc1404969dd3698c8b19c0f34ae815a84ae735a606a
<details><summary>datastore: v1.24.0</summary>

##
[v1.24.0](datastore/v1.23.0...datastore/v1.24.0)
(2026-05-19)

### Bug Fixes

* detach rollback context from transaction cancellation (#14602)
([0a8d10d](0a8d10d1))

* fix context leak in Iterator and Transaction spans (#14478)
([78de20d](78de20da))

* add retries to emulator (#14591)
([d944830](d9448309))

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: datastore Issues related to the Datastore API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants