fix: retry ABORTED writes in bulkCommit #1222

thebrianchen · 2020-06-17T01:43:41Z

With throttling enabled in the current state (no retries on backend), this implementation successfully gets all ABORTED writes to eventually succeed.

codecov · 2020-06-17T01:45:33Z

Codecov Report

Merging #1222 into node10 will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           node10    #1222      +/-   ##
==========================================
- Coverage   98.42%   98.42%   -0.01%     
==========================================
  Files          30       30              
  Lines       18539    18626      +87     
  Branches     1425     1437      +12     
==========================================
+ Hits        18247    18332      +85     
- Misses        289      291       +2     
  Partials        3        3

Impacted Files	Coverage Δ
dev/src/bulk-writer.ts	`98.70% <100.00%> (-0.16%)`	⬇️
dev/src/write-batch.ts	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97fdb5f...f5a3e38. Read the comment docs.

dev/src/write-batch.ts

schmidt-sebastian · 2020-06-17T17:18:37Z

dev/src/write-batch.ts

+      >('batchWrite', retryRequest, tag, retryCodes);
+
+      // Map the results of the retried request back to the original response.
+      for (let i = 0; i < abortedIndexes.length; i++) {


I was hoping we could just extract a new BulkCommitBatch and send it off using the existing bulkCommit(). Is that feasible?

I played around with it. Switching it to a recursive bulkCommit() call adds a layer of complexity that the while loop implementation doesn't have. The only repeat code that using bulkCommit() saves is the call to make the request -- the rest of the reduce/mapping code still needs to be there. Am I missing something here?

I am wondering if maybe this is the wrong layer - what would this look like if we made this change at the callsite to bulkCommit()? We could still have a single while loop, but it would encapsulate both sending the initial batch as well the retry. I threw this together in this Gist: https://gist.github.com/schmidt-sebastian/abbedb8412976166fe23ef3d323d613b

There is some major code smell in this proposal as we need to somehow extract the failed writes from a completed WriteBatch. We need to find a clean way to do this. I am beginning to think more and more that re-using the WriteBatch wasn't the right decision - after all, we don't actually use much of the Batch functionality, we mainly just use it to parse input, and maybe we could just have a couple of shared helper functions/a shared helper class that does this.

You will probably think that it is probably not worth changing all of this to make the code read a bit nicer. I am mostly with you on that, but I do think that we should try to mark the writes that do succeed as successful right away. I don't I believe we can get that to work if WriteBatch.bulkCommit() retries under the hood and blocks.

A customer could rightfully expect that we signal a successful write immediately It brings down the perceived latency of the WriteBatch endpoint and explains why a document shows up in the console or via other listeners. The only reason to artificially delay resolving the individual write Promise is the structure of our code, but we should keep in mind that are setting an example here for 5 more implementations. We should invest the effort to get this right.

I personally need to do a lot more thinking here to come up with a clean design that is easy to implement (even if it ends up being a large refactor, it should ideally be something that the IDE can do for the most part).

dev/src/write-batch.ts

dev/test/bulk-writer.ts

schmidt-sebastian · 2020-06-18T16:55:39Z

dev/src/v1/firestore_client_config.json

          "DEADLINE_EXCEEDED",
          "INTERNAL",
          "UNAVAILABLE"
+        ],


This change needs to be made here: https://source.corp.google.com/piper///depot/google3/google/firestore/v1/firestore_grpc_service_config.json and then propagated via our Open Source tooling.

It still seems missing?

update: in cl/318124350.

schmidt-sebastian · 2020-06-18T17:32:48Z

dev/src/write-batch.ts

+      >('batchWrite', retryRequest, tag, retryCodes);
+
+      // Map the results of the retried request back to the original response.
+      for (let i = 0; i < abortedIndexes.length; i++) {


I am wondering if maybe this is the wrong layer - what would this look like if we made this change at the callsite to bulkCommit()? We could still have a single while loop, but it would encapsulate both sending the initial batch as well the retry. I threw this together in this Gist: https://gist.github.com/schmidt-sebastian/abbedb8412976166fe23ef3d323d613b

There is some major code smell in this proposal as we need to somehow extract the failed writes from a completed WriteBatch. We need to find a clean way to do this. I am beginning to think more and more that re-using the WriteBatch wasn't the right decision - after all, we don't actually use much of the Batch functionality, we mainly just use it to parse input, and maybe we could just have a couple of shared helper functions/a shared helper class that does this.

You will probably think that it is probably not worth changing all of this to make the code read a bit nicer. I am mostly with you on that, but I do think that we should try to mark the writes that do succeed as successful right away. I don't I believe we can get that to work if WriteBatch.bulkCommit() retries under the hood and blocks.

A customer could rightfully expect that we signal a successful write immediately It brings down the perceived latency of the WriteBatch endpoint and explains why a document shows up in the console or via other listeners. The only reason to artificially delay resolving the individual write Promise is the structure of our code, but we should keep in mind that are setting an example here for 5 more implementations. We should invest the effort to get this right.

I personally need to do a lot more thinking here to come up with a clean design that is easy to implement (even if it ends up being a large refactor, it should ideally be something that the IDE can do for the most part).

thebrianchen · 2020-06-21T18:23:42Z

Made a pass at moving the retry handling to BulkWriter. One part of complexity that I wish I could remove is mapping the subsequent batch retries back to the original index on resultsMap.

schmidt-sebastian

Back to Brian to see if we can use the document path for index tracking.

schmidt-sebastian

Thanks for the update!

schmidt-sebastian · 2020-06-23T16:59:55Z

dev/src/bulk-writer.ts

+      for (let i = 0; i < results.length; i++) {
+        const writeResult = results[i];
+        if (this.shouldRetry(writeResult.status.code)) {
+          indexesToRetry.push(i);


We need to reject the result here if the write fails with an error that is not retryable.

done. Used your proposed simplification code.

schmidt-sebastian · 2020-06-23T17:02:42Z

dev/src/bulk-writer.ts

+    if (indexesToRetry.length === 0) {
+      this.completedDeferred.resolve();
+    }
+    return indexesToRetry;


Can we combine the different error handling code paths here?

Something like:

if (error) { results = Array.from({length:this.opCount}, () => { status: { code: error.code }}) } for (let i = 0; i < results.length; i++) { const writeResult = results[i]; if (ok) { const originalIndex = originalIndexMap.get(i)!; this.resultsMap.get(originalIndex)!.resolve(results[i]); } else if (this.shouldRetry(writeResult.status.code) indexesToRetry.push(i); } else { reject; } } if (indexesToRetry.length === 0) { this.completedDeferred.resolve(); } return indexesToRetry;

WOW, this is some poetic refactoring right here 🔥 🔥 🔥

schmidt-sebastian · 2020-06-23T17:07:34Z

dev/src/bulk-writer.ts

+      try {
+        await this.backoff.backoffAndWait();
+        const results = await this.writeBatch.bulkCommit();
+        retryIndexes = this.processResults(originalIndexMap, results);


Can we rewrite this so that processResults doesn't take originalIndexMap? It seems like we could apply the retryIndexes indexes in this function. I haven't played with this, but I wonder if this would lead to a simplification.

It's possible to not pass in originalIndexMap to processResults -- the successful results are located at the indexes not in retryIndexes. From there we could resolve/reject the individual promises and reject the rest.

However, the main issue is that the resultsMaps resolution logic would have to live in bulkCommit(), which then means that processResults() would become findRetryableIndexes(). From there, it would make more sense to have all the logic in bulkCommit, which would make the method much longer.

With the new refactor, using originalIndexMap in processResults seems a lot cleaner. WDYT?

schmidt-sebastian · 2020-06-23T17:08:42Z

dev/src/bulk-writer.ts

+    while (retryAttempts < MAX_RETRY_ATTEMPTS) {
+      let retryIndexes: number[] = [];
+      try {
+        await this.backoff.backoffAndWait();


Nit: Move this out of the try/catch block.

schmidt-sebastian · 2020-06-23T17:10:40Z

dev/src/v1/firestore_client_config.json

          "DEADLINE_EXCEEDED",
          "INTERNAL",
          "UNAVAILABLE"
+        ],


It still seems missing?

schmidt-sebastian · 2020-06-23T17:16:43Z

dev/src/write-batch.ts

+   * @param indexes List of operation indexes to keep
+   * @private
+   */
+  _sliceIndexes(indexes: number[]): void {


The Array slice() method returns a copy, which I think would be preferably here as well. Can we update this to create a new WriteBatch that is marked '_committed=false`? The reason I am asking is that I would ideally not expose a method that mocks with the internals of a WriteBatch, as the potential for future misuse is high.

Good point. I did not consider the future potential misuse part.

schmidt-sebastian · 2020-06-23T17:17:21Z

dev/test/bulk-writer.ts


+  it('retries retryable batchWrite failures', async () => {
+    setTimeoutHandler(setImmediate);
+    let retryCount = 0;


s/retry/attempt

schmidt-sebastian · 2020-06-23T17:20:16Z

dev/test/bulk-writer.ts

    });
  });

+  it('retries retryable batchWrite failures', async () => {


Is it possible to make these two tests more distinguishable?

Something like:

retries batchWrite when single operation fails vs retries batchWrite when entire RPC fails.

Furthermore, should we test what happens when we have both non-rertyable and retryable writes?

Changed names. Also added a non-retryable error to the test for ABORTED retries.

thebrianchen

Still waiting on the google3 changes to be approved.

thebrianchen · 2020-06-24T21:43:26Z

dev/src/bulk-writer.ts

+      for (let i = 0; i < results.length; i++) {
+        const writeResult = results[i];
+        if (this.shouldRetry(writeResult.status.code)) {
+          indexesToRetry.push(i);


done. Used your proposed simplification code.

thebrianchen · 2020-06-24T21:44:39Z

dev/src/bulk-writer.ts

+    while (retryAttempts < MAX_RETRY_ATTEMPTS) {
+      let retryIndexes: number[] = [];
+      try {
+        await this.backoff.backoffAndWait();


thebrianchen · 2020-06-24T21:44:57Z

dev/src/bulk-writer.ts

+    if (indexesToRetry.length === 0) {
+      this.completedDeferred.resolve();
+    }
+    return indexesToRetry;


WOW, this is some poetic refactoring right here 🔥 🔥 🔥

dev/src/write-batch.ts

thebrianchen · 2020-06-24T21:52:29Z

dev/src/write-batch.ts

+   * @param indexes List of operation indexes to keep
+   * @private
+   */
+  _sliceIndexes(indexes: number[]): void {


Good point. I did not consider the future potential misuse part.

thebrianchen · 2020-06-24T22:03:39Z

dev/test/bulk-writer.ts

    });
  });

+  it('retries retryable batchWrite failures', async () => {


Changed names. Also added a non-retryable error to the test for ABORTED retries.

thebrianchen · 2020-06-24T22:04:10Z

dev/test/bulk-writer.ts


+  it('retries retryable batchWrite failures', async () => {
+    setTimeoutHandler(setImmediate);
+    let retryCount = 0;


thebrianchen · 2020-06-25T01:23:37Z

dev/src/bulk-writer.ts

+      try {
+        await this.backoff.backoffAndWait();
+        const results = await this.writeBatch.bulkCommit();
+        retryIndexes = this.processResults(originalIndexMap, results);


It's possible to not pass in originalIndexMap to processResults -- the successful results are located at the indexes not in retryIndexes. From there we could resolve/reject the individual promises and reject the rest.

However, the main issue is that the resultsMaps resolution logic would have to live in bulkCommit(), which then means that processResults() would become findRetryableIndexes(). From there, it would make more sense to have all the logic in bulkCommit, which would make the method much longer.

With the new refactor, using originalIndexMap in processResults seems a lot cleaner. WDYT?

thebrianchen · 2020-06-25T01:33:09Z

dev/src/v1/firestore_client_config.json

          "DEADLINE_EXCEEDED",
          "INTERNAL",
          "UNAVAILABLE"
+        ],


update: in cl/318124350.

thebrianchen · 2020-06-25T21:50:26Z

Moved to #1243 in order to merge against master.

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jun 17, 2020

thebrianchen self-assigned this Jun 17, 2020

thebrianchen requested a review from schmidt-sebastian June 17, 2020 01:58

thebrianchen assigned schmidt-sebastian and unassigned thebrianchen Jun 17, 2020

schmidt-sebastian reviewed Jun 17, 2020

View reviewed changes

thebrianchen requested a review from schmidt-sebastian June 18, 2020 00:09

schmidt-sebastian reviewed Jun 18, 2020

View reviewed changes

schmidt-sebastian assigned thebrianchen and unassigned schmidt-sebastian Jun 18, 2020

Brian Chen added 5 commits June 19, 2020 12:03

fix: retry ABORTED BatchWrite ops

2e259a4

cleanup

45e1ba7

resolve comments rd.1

a691026

rebase

a59631d

attempt at moving retry logic to bulkwriter

bdfef3e

thebrianchen force-pushed the bc/retry-failed-writes branch from b8ce703 to bdfef3e Compare June 21, 2020 17:28

remove unused code

a82caa3

thebrianchen assigned schmidt-sebastian and unassigned thebrianchen Jun 21, 2020

schmidt-sebastian reviewed Jun 23, 2020

View reviewed changes

schmidt-sebastian assigned thebrianchen and schmidt-sebastian and unassigned schmidt-sebastian and thebrianchen Jun 23, 2020

schmidt-sebastian reviewed Jun 23, 2020

View reviewed changes

schmidt-sebastian assigned thebrianchen and unassigned schmidt-sebastian Jun 23, 2020

Merge branch 'node10' into bc/retry-failed-writes

7b1bac9

resolve comments: refactor processResults

6a76ccc

thebrianchen commented Jun 25, 2020

View reviewed changes

thebrianchen requested a review from schmidt-sebastian June 25, 2020 01:34

thebrianchen assigned schmidt-sebastian and unassigned thebrianchen Jun 25, 2020

remove client_config changes

f5a3e38

thebrianchen mentioned this pull request Jun 25, 2020

fix: retry ABORTED writes in bulkCommit #1243

Merged

thebrianchen closed this Jun 25, 2020

thebrianchen deleted the bc/retry-failed-writes branch June 28, 2020 21:28

fix: retry ABORTED writes in bulkCommit #1222

fix: retry ABORTED writes in bulkCommit #1222

Uh oh!

Conversation

thebrianchen commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thebrianchen commented Jun 21, 2020

Uh oh!

schmidt-sebastian left a comment

Choose a reason for hiding this comment

Uh oh!

schmidt-sebastian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thebrianchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thebrianchen commented Jun 17, 2020 •

edited

Loading

codecov bot commented Jun 17, 2020 •

edited

Loading