chore: fix some potential threading issues in BulkWriter #529

schmidt-sebastian · 2021-02-06T00:31:56Z

lastOperation is read from multiple threads
the rate limiter is not thread-safe

-lastOperation is read from multiple threads - the rate limiter is not thread-safe

codecov · 2021-02-06T00:33:13Z

Codecov Report

Merging #529 (4473827) into master (3ed79ae) will increase coverage by 0.36%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #529      +/-   ##
============================================
+ Coverage     73.54%   73.90%   +0.36%     
- Complexity     1058     1098      +40     
============================================
  Files            67       67              
  Lines          5722     5837     +115     
  Branches        686      714      +28     
============================================
+ Hits           4208     4314     +106     
  Misses         1299     1299              
- Partials        215      224       +9

Impacted Files	Coverage Δ	Complexity Δ
...in/java/com/google/cloud/firestore/BulkWriter.java	`100.00% <100.00%> (ø)`	`45.00 <17.00> (+3.00)`
...om/google/cloud/firestore/BulkWriterOperation.java	`100.00% <100.00%> (ø)`	`5.00 <0.00> (-2.00)`
...oogle/cloud/firestore/v1/FirestoreAdminClient.java	`57.46% <0.00%> (-1.89%)`	`38.00% <0.00%> (+8.00%)`	⬇️
.../com/google/cloud/firestore/CustomClassMapper.java	`76.84% <0.00%> (-0.06%)`	`111.00% <0.00%> (ø%)`
...va/com/google/cloud/firestore/BulkCommitBatch.java	`100.00% <0.00%> (ø)`	`7.00% <0.00%> (ø%)`
.../com/google/cloud/firestore/BulkWriterOptions.java	`100.00% <0.00%> (ø)`	`2.00% <0.00%> (ø%)`
...oogle/cloud/firestore/spi/v1/GrpcFirestoreRpc.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...restore/collection/ImmutableSortedMapIterator.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...rc/main/java/com/google/cloud/firestore/Watch.java	`91.30% <0.00%> (+0.03%)`	`56.00% <0.00%> (ø%)`
.../com/google/cloud/firestore/UserDataConverter.java	`95.37% <0.00%> (+0.04%)`	`37.00% <0.00%> (ø%)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ed79ae...b9a2f56. Read the comment docs.

schmidt-sebastian · 2021-02-06T00:40:57Z

Hm, this is now super flaky.

schmidt-sebastian · 2021-02-06T02:43:10Z

google-cloud-firestore/src/test/java/com/google/cloud/firestore/BulkWriterTest.java


  @Test
-  public void retriesMaintainCorrectWriteResolutionOrdering() throws Exception {
+  public void retryResolvesBeforeFlush() throws Exception {


I don't think I will be able to deflake the original test as the order of whether doc1, doc2, doc1 or doc1, doc1, doc2 is enqueued is no longer deterministic. I don't think that's a problem though, but it breaks this test.

This behavior is still reasonable though, so it should be fine.

thebrianchen · 2021-02-09T18:17:20Z

google-cloud-firestore/src/test/java/com/google/cloud/firestore/BulkWriterTest.java


  @Test
-  public void retriesMaintainCorrectWriteResolutionOrdering() throws Exception {
+  public void retryResolvesBeforeFlush() throws Exception {


This behavior is still reasonable though, so it should be fine.

thebrianchen · 2021-02-09T18:17:22Z

google-cloud-firestore/src/main/java/com/google/cloud/firestore/BulkWriter.java

   * time a new write is enqueued.
   */
-  private ApiFuture<Void> lastOperation = ApiFutures.immediateFuture(null);
+  private volatile ApiFuture<Void> lastOperation = ApiFutures.immediateFuture(null);


Woah, did not know about volatile in Java.

If lastOperation is only accessed inside the synchronized call, why do we still need the volatile label? My stackoverflow hunting initially led me to think of volatile variables as a simpler/weaker form of synchronization than synchronized, which makes volatile redundant (link, another link). However, after reading another conflicting post that argues volatile is required for reading the variable, I'm now uncertain.

Actually, I realize we're reading lastOperation in flush(), which is out of the synchronized block, which then means that we need to use volatile.

Drive by comment... Although I can't find the documentation to back this up, I'm 99% sure that you basically never need to declare a variable as volatile in Java. In my entire career I have never needed to use volatile, even in heavily multi-threaded code. I think that the only cases where you would need to use volatile would be (a) if you're implementing some sort of custom lock-free synchronization on the variable (e.g. busy waiting) or (b) if the variable's value could be changed asynchronously from native code.

volatile ensures that the value that is read is never read from the CPU cache. Without this, flush() is not guaranteed to see the latest state of lastOperation if lastOperation was modified by another thread.

cc @wilhuff

This volatile is an indicator of broader problems within this implementation.

verifyNotClosed is called unlocked and accesses closed without locking

writesEnqueued is written to by executeWrite also without locking

In Java, volatile is expensive relative to the underlying atomic operations that CPUs can perform--it's like a half lock. If you made all of these values volatile it would actually be more expensive than just synchronizing. In Java synchronization is very cheap in the uncontended case, and this doesn't seem to be a class that should have to optimize for that.

My advice:

don't use volatile; just use synchronization consistently

acquire the lock nearer the API boundary so you're not acquiring it multiple times per private method used to implement the API call

name your internal methods to show you assume the lock is held (e.g. verifyNotClosedLocked)

avoid trying to be fancy with multithreading at all

That is to say, at a high level, the way to fix the problem with flush accessing lastOperation outside a synchronized block is to add a synchronized block. You should also fix the other unsynchonized accesses by adding synchronized blocks. Once you do, you'll notice you're synchronizing multiple times per API invocation, in which case you can hoist the synchronized blocks.

Also, see go/atomic-danger internally.

Thanks for the answer, very helpful. I will look at this a bit more tomorrow (and will likely add more synchronization).

To "defend" the current state of the PR: Both this.closed and this.writesEnqueued is only accessed from the user thread. bulkCommitBatch, lastOperation and rateLimiter are accessed both from the user thread and as a result of a BatchCommit response, in which case they are accessed from the BulkWriter thread.

Time to read go/atomic-danger

I updated this PR to use synchronization. PTAL.

Wow my understanding of Java volatile was very mis-guided. I just read https://developer.android.com/reference/java/util/concurrent/atomic/package-summary and the documents to which it links.

I learned that the following code could loop forever if this.done is not declared volatile:

while (!this.done) { Thread.sleep(100); }

The compiler could choose to read this.done exactly once and re-use that cached value forever, even if another thread comes in and sets this.done to true. Declaring this.done as volatile would preclude this optimization and the thread would eventually see this.done set to true by another thread. I did not know this and have definitely coded some buggy busy waits in the past!

thebrianchen

I see how moving the synchronized calls up and using the locked method naming makes the code easier to understand, thanks!

Also, why were the @GuardedBy annotations removed? I think having them makes it much easier to quickly see which variables are accessed under lock.

For example, I'm looking at some Android source code, and they use the Locked method naming convention along with @GuardedBy annotations for the instance variables that are accessed under lock.

thebrianchen · 2021-02-10T17:37:18Z

google-cloud-firestore/src/main/java/com/google/cloud/firestore/BulkWriter.java

+  private final ScheduledExecutorService bulkWriterExecutor;
+
+  /**
+   * Lock object for all mutable state in bulk writer. Bulk Writer state is accessed from the user


s/Bulk Writer/BulkWriter

thebrianchen · 2021-02-10T19:30:59Z

google-cloud-firestore/src/main/java/com/google/cloud/firestore/BulkWriter.java

   * Used to track when writes are enqueued. The user handler executors cannot be changed after a
   * write has been enqueued.
   */
  private boolean writesEnqueued = false;


Since all accesses to writes enqueued are under lock, should we add a @GuardedBy annotation?

I removed them since this now applies to all state. Based on the Android link, I added them back. There is also some tooling that catches violations when GuardedBy object is accessed under lock (i.e. in Android Studio's build), so it is likely always better to add this annotations.

chore: fix some potential threading issues in BulkWriter

a7d142f

-lastOperation is read from multiple threads - the rate limiter is not thread-safe

schmidt-sebastian requested a review from a team as a code owner February 6, 2021 00:31

schmidt-sebastian requested a review from a team February 6, 2021 00:31

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Feb 6, 2021

product-auto-label bot added the api: firestore Issues related to the googleapis/java-firestore API. label Feb 6, 2021

schmidt-sebastian requested a review from thebrianchen February 6, 2021 00:33

schmidt-sebastian assigned thebrianchen Feb 6, 2021

schmidt-sebastian closed this Feb 6, 2021

Rewrite test

e161785

schmidt-sebastian reopened this Feb 6, 2021

Format

467aa40

schmidt-sebastian commented Feb 6, 2021

View reviewed changes

thebrianchen approved these changes Feb 9, 2021

View reviewed changes

thebrianchen assigned schmidt-sebastian and unassigned thebrianchen Feb 9, 2021

schmidt-sebastian added 2 commits February 9, 2021 11:58

Deflake sendBatchesWhenSizeLimitIsReached

3336d25

Use synchronized

83a7c20

schmidt-sebastian assigned thebrianchen and wilhuff and unassigned schmidt-sebastian Feb 10, 2021

thebrianchen approved these changes Feb 10, 2021

View reviewed changes

Add GuardedBy back

8cb825f

thebrianchen removed their assignment Feb 10, 2021

Format

b9a2f56

schmidt-sebastian merged commit 6841cd5 into master Feb 11, 2021

schmidt-sebastian deleted the mrschmidt/fixthreading branch February 11, 2021 20:48

thebrianchen mentioned this pull request Feb 25, 2021

firestore.BulkWriterTest: retriesMaintainCorrectWriteResolutionOrdering failed #530

Closed

chore: fix some potential threading issues in BulkWriter #529

chore: fix some potential threading issues in BulkWriter #529

Uh oh!

Conversation

schmidt-sebastian commented Feb 6, 2021

Uh oh!

codecov bot commented Feb 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

schmidt-sebastian commented Feb 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thebrianchen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 6, 2021 •

edited

Loading