datsketches extension updated to use the latest sketches-core-0.12.0#6381
Conversation
|
sketches-core-0.12.0 uses memory-0.12.0, which forced most of the changes. In particular, new memory is sensitive to the byte order of ByteBuffer received from Druid, which is not always right. MemoryWrapper was introduced to force little endian byte order. |
|
SynchronizedUnion wrapper in Theta module was removed since it was not compatible with the new API of Union. Synchronization was added to aggregate() and get() methods in SketchAggregator instead. |
There was a problem hiding this comment.
I wonder why Memory.wrap(..) and WritableMemory.wrap(..) don't handle this conversion instead of forcing all systems calling those methods?
also, this is kind of error prone in case we miss calling wrap via LittleEndianWrapper and call Memory.wrap(..) directly in some place.
There was a problem hiding this comment.
I see some direct calls to Memory.wrap() still remain, like in ArrayOfDoublesSketch.deserializeFromByteArray, do all call sites need to go through the LittleEndianWrapper?
There was a problem hiding this comment.
Also, would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?
There was a problem hiding this comment.
This is about wrapping a ByteBuffer. New version of the memory library handles the byte order of the incoming ByteBuffer correctly (before it was simplistic and forgiving). But Druid sometimes does not set the byte order correctly. So now it is necessary to force the byte order of sketches to be little endian. And it must be done externally to the memory library because it is really about Druid not setting it correctly.
There was a problem hiding this comment.
would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?
This is a good point. I will think about it.
There was a problem hiding this comment.
I see some direct calls to Memory.wrap() still remain
They still remain where the input is not a ByteBuffer
There was a problem hiding this comment.
so, are we saying that sketch objects are "always" serialized in little endian and when druid loads those bytes in the ByteBuffer, the order is big endian (jvm default) and that needs to be corrected since memory doesn't know that it is getting a sketch object and wouldn't be able to correctly handle the contents otherwise?
There was a problem hiding this comment.
Please add comments to these methods on why the wrapping is necessary and why it's only needed for ByteByffer and not byte[]
There was a problem hiding this comment.
we call union.update(..) in updateUnion(..) method defined later in the same file which is prone to failure without SynchronizedUnion.
Also, I wouldn't want to put synchronization on that method itself as it gets called in other situations where synchronization isn't necessary.
So, It probably still makes sense to keep SynchronizedUnion class around.
There was a problem hiding this comment.
I don't see how synchronizing on Union objects would be avoided in those other situations when synchronization is not necessary.
Could you clarify what is error-prone? We obtain exclusive locks in aggregate() and get() in case they are called at the same time.
There was a problem hiding this comment.
It probably still makes sense to keep SynchronizedUnion class around
Given the new API, it is impossible to make SynchronizedUnion a subclass of Union as it was before. But making it just a wrapper breaks a lot of other code, so it was way easier to get rid of it altogether.
There was a problem hiding this comment.
you're right , Now I notice that updateUnion(..) is also called from synchronized block. nvmnd.
There was a problem hiding this comment.
can you please explain this change? given that everything is backwards compatible why did this have to change?
There was a problem hiding this comment.
This was a mistake in the test. The number of values in the union and incoming sketches must match, which was not enforced before. Correct code should not be affected.
|
@AlexanderSaydakov yes, SynchronizedUnion is not needed in buffer aggregators which are never used concurrently. Aggregator on the other hand gets used at "realtime" processes concurrently to index as well as query data. |
That is great, if so. But I wonder why in previous reviews of other modules (quantiles in particular) I was advised to use that striped locking in buffer aggregators. Perhaps we should revise those to get rid of unnecessary complexity. |
|
I see that conflicts appeared. I will take care of them in a couple of days. |
3ff7999 to
bd6067a
Compare
|
Rebased to resolve conflicts. |
|
@himanshug why do you think that buffer aggregators are never used concurrently? Aren't they in |
|
We might want to remove However: groupBy v1 with the |
|
My long term intention is the opposite: remove |
|
Does this imply that we still want to have synchronization in buffer aggregators? |
|
@AlexanderSaydakov yes. But with people writing so many different aggregators now, it seems increasingly wrong to make people do this in aggregators code. I think #3956 should become a high priority issue. I like the approach suggested by @himanshug here: #3956 (comment) with |
Everything must be compatible