KAFKA-9156: fix LazyTimeIndex & LazyOffsetIndex concurrency#7760
Conversation
Moved common logic of LazyOffsetIndex and LazyTimeIndex into a common abstract class. Moved file reference modification into a sealed trait to simplify "file" getters and setters.
|
Thanks for the update, looking better. I left a couple more comments. |
|
For some reason, I cannot push to your branch, so I submitted a PR: https://github.com/bookingcom/kafka/pull/1/files |
Probably due to org. settings. LGTM, so I merged it, thank you! Also, added a few missing type annotations to |
ijuma
left a comment
There was a problem hiding this comment.
LGTM, just a minor documentation suggestion below. It would be good to get a second opinion from @junrao or @hachikuji before we merge this to 2.3, 2.4 and trunk.
|
@alexandrfox In the PR description, it's worth clarifying that mutation operations are done under a lock in the caller classes. I think the gap is that |
Makes sense, I think LogCleaner also may call |
junrao
left a comment
There was a problem hiding this comment.
@alexandrfox : Thanks for the PR. LGTM. Just a minor comment below.
| def file_=(f: File) | ||
| } | ||
|
|
||
| private class IndexFile(@volatile var file: File) extends IndexWrapper |
There was a problem hiding this comment.
Does file need to be var? It seems that it's never updated.
There was a problem hiding this comment.
hey @junrao, yes, this reference is updated when LazyIndex.file_= setter is invoked before LazyIndex.get
There was a problem hiding this comment.
@alexandrfox : Thanks. That makes sense. So, we can keep the code as it is.
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Nice fix and thanks for the additional cleanup!
|
@ijuma looks like we have approvals, thanks to @junrao and @hachikuji. And thank you for taking your time guiding me through my first PR! Also, thanks to @lushilin for pin-pointing the problem! |
|
One job passed, two jobs failed with known flaky tests:
|
Race condition in concurrent `get` method invocation of lazy indexes might lead to multiple `OffsetIndex`/`TimeIndex` objects being concurrently created. When that happens position of `MappedByteBuffer` in `AbstractIndex` is advanced to the last entry which in turn leads to a critical `BufferOverflowException` being thrown whenever leader or replica tries to append a record to the segment. Moreover, `file_=` setter is seemingly also vulnerable to the race, since multiple threads can attempt to set a new file reference as well as create new Time/OffsetIndex objects at the same time. This might lead to the discrepant File references being held by e.g. LazyTimeIndex and its TimeIndex counterpart. This patch attempts to fix the issue by making sure that index objects are atomically constructed when loaded lazily via `get` method. Additionally, `file` reference modifications are also made atomic and thread safe. Note that the `Lazy*Index` mutation operations are executed with a lock held by the callers, but `get` can be called without a lock (e.g. from `Log.read`). Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>, Shilin Lu, Ismael Juma <ismael@juma.me.uk>
Race condition in concurrent `get` method invocation of lazy indexes might lead to multiple `OffsetIndex`/`TimeIndex` objects being concurrently created. When that happens position of `MappedByteBuffer` in `AbstractIndex` is advanced to the last entry which in turn leads to a critical `BufferOverflowException` being thrown whenever leader or replica tries to append a record to the segment. Moreover, `file_=` setter is seemingly also vulnerable to the race, since multiple threads can attempt to set a new file reference as well as create new Time/OffsetIndex objects at the same time. This might lead to the discrepant File references being held by e.g. LazyTimeIndex and its TimeIndex counterpart. This patch attempts to fix the issue by making sure that index objects are atomically constructed when loaded lazily via `get` method. Additionally, `file` reference modifications are also made atomic and thread safe. Note that the `Lazy*Index` mutation operations are executed with a lock held by the callers, but `get` can be called without a lock (e.g. from `Log.read`). Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>, Shilin Lu, Ismael Juma <ismael@juma.me.uk>
|
Merged to trunk, 2.4 and 2.3. |
Fix of https://issues.apache.org/jira/browse/KAFKA-9156
Race condition in concurrent
getmethod invocation of lazy indexes might lead to multipleOffsetIndex/TimeIndexobjects being concurrently created. When that happens position ofMappedByteBufferinAbstractIndexis advanced to the last entry which in turn leads to a criticalBufferOverflowExceptionbeing thrown whenever leader or replica tries to append a record to the segment.Moreover,
file_=setter is seemingly also vulnerable to the race, since multiple threads can attempt to set a new file reference as well as create new Time/OffsetIndex objects at the same time. This might lead to the discrepant File references being held by e.g. LazyTimeIndex and its TimeIndex counterpart.This patch attempts to fix the issue by making sure that index objects are atomically constructed when loaded lazily via
getmethod. Additionally,filereference modifications are also made atomic and thread safe.Committer Checklist (excluded from commit message)