-
Notifications
You must be signed in to change notification settings - Fork 75
chore: fix some potential threading issues in BulkWriter #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
schmidt-sebastian
commented
Feb 6, 2021
- lastOperation is read from multiple threads
- the rate limiter is not thread-safe
-lastOperation is read from multiple threads - the rate limiter is not thread-safe
Codecov Report
@@ Coverage Diff @@
## master #529 +/- ##
============================================
+ Coverage 73.54% 73.90% +0.36%
- Complexity 1058 1098 +40
============================================
Files 67 67
Lines 5722 5837 +115
Branches 686 714 +28
============================================
+ Hits 4208 4314 +106
Misses 1299 1299
- Partials 215 224 +9 Continue to review full report at Codecov.
|
|
Hm, this is now super flaky. |
|
|
||
| @Test | ||
| public void retriesMaintainCorrectWriteResolutionOrdering() throws Exception { | ||
| public void retryResolvesBeforeFlush() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I will be able to deflake the original test as the order of whether doc1, doc2, doc1 or doc1, doc1, doc2 is enqueued is no longer deterministic. I don't think that's a problem though, but it breaks this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is still reasonable though, so it should be fine.
|
|
||
| @Test | ||
| public void retriesMaintainCorrectWriteResolutionOrdering() throws Exception { | ||
| public void retryResolvesBeforeFlush() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is still reasonable though, so it should be fine.
| * time a new write is enqueued. | ||
| */ | ||
| private ApiFuture<Void> lastOperation = ApiFutures.immediateFuture(null); | ||
| private volatile ApiFuture<Void> lastOperation = ApiFutures.immediateFuture(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, did not know about volatile in Java.
If lastOperation is only accessed inside the synchronized call, why do we still need the volatile label? My stackoverflow hunting initially led me to think of volatile variables as a simpler/weaker form of synchronization than synchronized, which makes volatile redundant (link, another link). However, after reading another conflicting post that argues volatile is required for reading the variable, I'm now uncertain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realize we're reading lastOperation in flush(), which is out of the synchronized block, which then means that we need to use volatile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by comment... Although I can't find the documentation to back this up, I'm 99% sure that you basically never need to declare a variable as volatile in Java. In my entire career I have never needed to use volatile, even in heavily multi-threaded code. I think that the only cases where you would need to use volatile would be (a) if you're implementing some sort of custom lock-free synchronization on the variable (e.g. busy waiting) or (b) if the variable's value could be changed asynchronously from native code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volatile ensures that the value that is read is never read from the CPU cache. Without this, flush() is not guaranteed to see the latest state of lastOperation if lastOperation was modified by another thread.
cc @wilhuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This volatile is an indicator of broader problems within this implementation.
verifyNotClosedis called unlocked and accessesclosedwithout lockingwritesEnqueuedis written to byexecuteWritealso without locking
In Java, volatile is expensive relative to the underlying atomic operations that CPUs can perform--it's like a half lock. If you made all of these values volatile it would actually be more expensive than just synchronizing. In Java synchronization is very cheap in the uncontended case, and this doesn't seem to be a class that should have to optimize for that.
My advice:
- don't use volatile; just use synchronization consistently
- acquire the lock nearer the API boundary so you're not acquiring it multiple times per private method used to implement the API call
- name your internal methods to show you assume the lock is held (e.g.
verifyNotClosedLocked) - avoid trying to be fancy with multithreading at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to say, at a high level, the way to fix the problem with flush accessing lastOperation outside a synchronized block is to add a synchronized block. You should also fix the other unsynchonized accesses by adding synchronized blocks. Once you do, you'll notice you're synchronizing multiple times per API invocation, in which case you can hoist the synchronized blocks.
Also, see go/atomic-danger internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answer, very helpful. I will look at this a bit more tomorrow (and will likely add more synchronization).
To "defend" the current state of the PR: Both this.closed and this.writesEnqueued is only accessed from the user thread. bulkCommitBatch, lastOperation and rateLimiter are accessed both from the user thread and as a result of a BatchCommit response, in which case they are accessed from the BulkWriter thread.
Time to read go/atomic-danger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR to use synchronization. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow my understanding of Java volatile was very mis-guided. I just read https://developer.android.com/reference/java/util/concurrent/atomic/package-summary and the documents to which it links.
I learned that the following code could loop forever if this.done is not declared volatile:
while (!this.done) {
Thread.sleep(100);
}The compiler could choose to read this.done exactly once and re-use that cached value forever, even if another thread comes in and sets this.done to true. Declaring this.done as volatile would preclude this optimization and the thread would eventually see this.done set to true by another thread. I did not know this and have definitely coded some buggy busy waits in the past!
thebrianchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how moving the synchronized calls up and using the locked method naming makes the code easier to understand, thanks!
Also, why were the @GuardedBy annotations removed? I think having them makes it much easier to quickly see which variables are accessed under lock.
For example, I'm looking at some Android source code, and they use the Locked method naming convention along with @GuardedBy annotations for the instance variables that are accessed under lock.
| private final ScheduledExecutorService bulkWriterExecutor; | ||
|
|
||
| /** | ||
| * Lock object for all mutable state in bulk writer. Bulk Writer state is accessed from the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Bulk Writer/BulkWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * Used to track when writes are enqueued. The user handler executors cannot be changed after a | ||
| * write has been enqueued. | ||
| */ | ||
| private boolean writesEnqueued = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all accesses to writes enqueued are under lock, should we add a @GuardedBy annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them since this now applies to all state. Based on the Android link, I added them back. There is also some tooling that catches violations when GuardedBy object is accessed under lock (i.e. in Android Studio's build), so it is likely always better to add this annotations.