fix(storage): Fix takeover response handling. #13239

cjc25 · 2025-10-27T18:50:42Z

A takeover response can return a handle. In that case, a retry should present that handle. Otherwise it will attempt a takeover when it retries. After a new takeover, the client can't tell if a partial write after the last known flush offset was performed by itself, or by some other writer. Therefore the client can't tell if it's safe to retry from the last known flush offset, and the server will return an error if it repeats offsets.

The write handle allows the client to present enough information to the server to correctly identify replays on the same write session, avoiding an error when the client repeats bytes that are already durable.

gemini-code-assist · 2025-10-27T18:51:00Z

Summary of Changes

Hello @cjc25, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in storage write operations where a takeover response might provide a write handle. Previously, this handle was not properly utilized during retries, leading to potential errors due to incorrect identification of write sessions. The changes ensure that the client correctly processes and uses this handle, allowing for safe and idempotent retries after a takeover, thereby improving the robustness of appendable uploads.

Highlights

Takeover Response Handling: Implemented logic to correctly process takeover responses that include a write handle, ensuring that subsequent retries utilize this handle to maintain write session integrity and prevent errors.
Retry Conformance Test: Added a new conformance test specifically for appendable uploads involving a takeover, validating the end-to-end correctness of write operations after a session takeover.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes an issue with takeover response handling in resumable writes by ensuring the write handle from the server is used in subsequent retries. The core logic change in grpc_writer.go is correct and well-targeted. A new conformance test has been added to verify this fix. My review focuses on the new test code in retry_conformance_test.go, where I've identified several opportunities to improve maintainability and consistency. Specifically, I've suggested refactoring to reduce code duplication, improving the structure of the long test function for better readability, and making error handling more consistent and robust. The changes effectively address the described bug.

gemini-code-assist · 2025-10-27T18:52:52Z

storage/retry_conformance_test.go

+			if err := b.Create(ctx, projectID, nil); err != nil {
+				return err
+			}


For better error diagnostics and consistency with other error handling in this function, consider wrapping this error with additional context.

Suggested change

if err := b.Create(ctx, projectID, nil); err != nil {

return err

}

if err := b.Create(ctx, projectID, nil); err != nil {

return fmt.Errorf("failed to create bucket: %w", err)

}

gemini-code-assist · 2025-10-27T18:52:52Z

storage/retry_conformance_test.go

+			// Close this writer, which will create the appendable unfinalized object
+			// (there was not enough in Write to trigger a send).
+			if err := objW.Close(); err != nil {
+				return fmt.Errorf("Creation Writer.Close: %v", err)


To enable proper error unwrapping using errors.Is or errors.As, it's better to use the %w format verb instead of %v when wrapping errors. This function is inconsistent, as it uses %w on line 764 but uses %v here and in several other places (e.g., lines 779, 786, 789, 800, 804). Using %w consistently is a modern Go best practice.

Suggested change

return fmt.Errorf("Creation Writer.Close: %v", err)

return fmt.Errorf("Creation Writer.Close: %w", err)

gemini-code-assist · 2025-10-27T18:52:52Z

storage/retry_conformance_test.go

+		func(ctx context.Context, c *Client, fs *resources, preconditions bool) error {
+			bucketName := fmt.Sprintf("%s-appendable", bucketIDs.New())
+			b := c.Bucket(bucketName)
+			if err := b.Create(ctx, projectID, nil); err != nil {
+				return err
+			}
+			defer b.Delete(ctx)
+
+			obj := b.Object(objectIDs.New())
+			if preconditions {
+				obj = obj.If(Conditions{DoesNotExist: true})
+			}
+
+			// Force multiple messages per chunk, and multiple chunks in the object.
+			chunkSize := 2 * maxPerMessageWriteSize
+			toWrite := generateRandomBytes(chunkSize * 3)
+
+			objW := obj.NewWriter(ctx)
+			objW.Append = true
+			objW.ChunkSize = chunkSize
+			if _, err := objW.Write(toWrite[0:maxPerMessageWriteSize]); err != nil {
+				return fmt.Errorf("Writer.Write: %w", err)
+			}
+			// Close this writer, which will create the appendable unfinalized object
+			// (there was not enough in Write to trigger a send).
+			if err := objW.Close(); err != nil {
+				return fmt.Errorf("Creation Writer.Close: %v", err)
+			}
+
+			generation := int64(0)
+			if preconditions {
+				generation = objW.Attrs().Generation
+			}
+			objT := b.Object(obj.ObjectName()).Generation(generation)
+			w, l, err := objT.NewWriterFromAppendableObject(ctx, &AppendableWriterOpts{ChunkSize: chunkSize})
+			if err != nil {
+				return fmt.Errorf("NewWriterFromAppendableObject: %v", err)
+			}
+			if l != int64(maxPerMessageWriteSize) {
+				return fmt.Errorf("NewWriterFromAppendableObject unexpected len: got %v, want %v", l, maxPerMessageWriteSize)
+			}
+
+			if _, err := w.Write(toWrite[maxPerMessageWriteSize:]); err != nil {
+				return fmt.Errorf("Writer.Write: %v", err)
+			}
+			if err := w.Close(); err != nil {
+				return fmt.Errorf("Writer.Close: %v", err)
+			}
+
+			if w.Attrs() == nil {
+				return fmt.Errorf("Writer.Attrs: expected attrs for written object, got nil")
+			}
+
+			// Don't reuse obj, in case preconditions were set on the write request.
+			r, err := b.Object(obj.ObjectName()).NewReader(ctx)
+			defer r.Close()
+			if err != nil {
+				return fmt.Errorf("obj.NewReader: %v", err)
+			}
+			content, err := io.ReadAll(r)
+			if err != nil {
+				return fmt.Errorf("Reader.Read: %v", err)
+			}
+
 			gotMd5 := md5.Sum(content)
 			expectedMd5 := md5.Sum(toWrite)
 			if d := cmp.Diff(gotMd5, expectedMd5); d != "" {


This new test function is quite long (over 60 lines) and contains multiple logical steps, which can make it difficult to understand and maintain. Consider refactoring it by extracting distinct stages of the test into smaller, well-named helper functions. For example, you could have separate helpers for:

Setting up and creating the initial appendable object.

Performing the takeover write.

This would make the main test function a much clearer, high-level description of the test scenario.

gemini-code-assist · 2025-10-27T18:52:53Z

storage/retry_conformance_test.go

 			gotMd5 := md5.Sum(content)
 			expectedMd5 := md5.Sum(toWrite)
 			if d := cmp.Diff(gotMd5, expectedMd5); d != "" {


This block of code for verifying content is duplicated from the test case above (lines 734-741). To improve maintainability and reduce code duplication, consider extracting this logic into a shared helper function. For example:

func verifyContent(content, toWrite []byte) error { gotMd5 := md5.Sum(content) expectedMd5 := md5.Sum(toWrite) if d := cmp.Diff(gotMd5, expectedMd5); d != "" { return fmt.Errorf("content mismatch, got %v bytes (md5: %v), want %v bytes (md5: %v)", len(content), gotMd5, len(toWrite), expectedMd5) } return nil }

Then both tests could simply call return verifyContent(content, toWrite) at the end.

cjc25 · 2025-10-27T19:27:54Z

The retry conformance tests will start to pass once the testbench supports single-threaded takeovers, in v0.56.0.

The MRD timeout is a preexisting flake, I think.

tritone

This looks good to me.

Does the intersection of the way the write is split up and the retry test scenarios cover a retry both on the initial takeover req and midway through the takeover write? Might be good to note that explicitly as far as how you chose how much data to write when.

A takeover response can return a handle. In that case, a retry should present that handle. Otherwise it will attempt a takeover when it retries. After a new takeover, the client can't tell if a partial write after the last known flush offset was performed by itself, or by some other writer. Therefore the client can't tell if it's safe to retry from the last known flush offset, and the server will return an error if it repeats offsets. The write handle allows the client to present enough information to the server to correctly identify replays on the same write session, avoiding an error when the client repeats bytes that are already durable.

cjc25 · 2025-10-28T02:45:18Z

The test found a real bug (yay!)

On rare occasions on my workstation, the takeover writer issues its first write at an earlier offset than the one returned by the server. The real service treats this as an error, but we don't model the full state machine in the testbench and it falls into retry handling, which silently drops these bytes.

This probably means that there is a race in the spot where we determine the takeover offset for future writes. I'll figure that out before merging this PR.

cjc25 · 2025-10-28T03:16:57Z

OK the bug is that we don't guarantee that the writer has handled the completion from the takeover response fully before returning control to the caller code.

Caller gets control back when setTakeoverOffset is called here: https://github.com/googleapis/google-cloud-go/pull/13239/files#diff-2b5c89d316046e2b33163778a826af79ec60ff23d70a93cd0d4fa1daa2a494e8R1242.

The writer has a fully up-to-date view of the correct buffer offsets when the completion put on the channel during takeover (https://github.com/googleapis/google-cloud-go/pull/13239/files#diff-2b5c89d316046e2b33163778a826af79ec60ff23d70a93cd0d4fa1daa2a494e8R1244) is processed by the writeLoop (https://github.com/googleapis/google-cloud-go/pull/13239/files#diff-2b5c89d316046e2b33163778a826af79ec60ff23d70a93cd0d4fa1daa2a494e8R447-R451)

We should guarantee that we've handled this completion before returning control to the caller. It's a small fix, I think, so I'll just include it in this PR.

tritone

nice!

🤖 I have created a release *beep* *boop* --- ## [1.57.1](storage/v1.57.0...storage/v1.57.1) (2025-10-28) ### Bug Fixes * **storage:** Takeover idempotence. ([#13230](#13230)) ([cc5d2a1](cc5d2a1)) * **storage:** Copy metadata when using Copier with grpc ([#12919](#12919)) ([57a2e80](57a2e80)) * **storage:** Fix takeover response handling. ([#13239](#13239)) ([26d75bc](26d75bc)) * **storage:** Remove default timeout for gRPC operations ([#13022](#13022)) ([b94c3ba](b94c3ba)) * **storage:** Skip download of file outside of target dir ([#12945](#12945)) ([6259aee](6259aee)) * **storage:** Upgrade gRPC service registration func ([8fffca2](8fffca2)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com> Co-authored-by: Brenna N Epp <brennae@google.com>

The initial takeover response wasn't being checked for a redirect. googleapis#13239 addressed successful responses. This addresses error responses.

The initial takeover response wasn't being checked for a redirect. PR #13239 addressed successful responses. This addresses error responses. See #12785, b/459905219

cjc25 requested review from a team as code owners October 27, 2025 18:50

product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

tritone previously approved these changes Oct 27, 2025

View reviewed changes

tritone added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 27, 2025

kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 27, 2025

cpriti-os added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 27, 2025

kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 27, 2025

cjc25 force-pushed the extract-handle-from-takeover-response branch from 7a11e36 to f9ddb31 Compare October 28, 2025 01:14

tritone added the automerge Merge the pull request once unit tests and other checks pass. label Oct 28, 2025

fixup! fix(storage): Fix takeover response handling.

d181bbc

cjc25 dismissed tritone’s stale review via d181bbc October 28, 2025 03:21

tritone approved these changes Oct 28, 2025

View reviewed changes

gcf-merge-on-green bot merged commit 26d75bc into googleapis:main Oct 28, 2025
9 checks passed

gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Oct 28, 2025

release-please bot mentioned this pull request Oct 28, 2025

chore(main): release storage 1.57.1 #13061

Merged

cjc25 added a commit to cjc25/google-cloud-go that referenced this pull request Nov 13, 2025

fix(storage): Handle redirect on takeover.

2130ed4

The initial takeover response wasn't being checked for a redirect. googleapis#13239 addressed successful responses. This addresses error responses.

cjc25 mentioned this pull request Nov 13, 2025

fix(storage): Handle redirect on takeover. #13354

Merged

cjc25 deleted the extract-handle-from-takeover-response branch November 18, 2025 16:19

tallconsulting-boop approved these changes Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(storage): Fix takeover response handling. #13239

fix(storage): Fix takeover response handling. #13239

Uh oh!

cjc25 commented Oct 27, 2025

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

cjc25 commented Oct 27, 2025

Uh oh!

tritone left a comment

Uh oh!

cjc25 commented Oct 28, 2025

Uh oh!

cjc25 commented Oct 28, 2025

Uh oh!

tritone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	return fmt.Errorf("Creation Writer.Close: %v", err)
	return fmt.Errorf("Creation Writer.Close: %w", err)

fix(storage): Fix takeover response handling. #13239

fix(storage): Fix takeover response handling. #13239

Uh oh!

Conversation

cjc25 commented Oct 27, 2025

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

cjc25 commented Oct 27, 2025

Uh oh!

tritone left a comment

Choose a reason for hiding this comment

Uh oh!

cjc25 commented Oct 28, 2025

Uh oh!

cjc25 commented Oct 28, 2025

Uh oh!

tritone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants