datsketches extension updated to use the latest sketches-core-0.12.0 by AlexanderSaydakov · Pull Request #6381 · apache/druid

AlexanderSaydakov · 2018-09-25T23:48:46Z

Everything must be compatible

AlexanderSaydakov · 2018-09-26T00:15:03Z

sketches-core-0.12.0 uses memory-0.12.0, which forced most of the changes. In particular, new memory is sensitive to the byte order of ByteBuffer received from Druid, which is not always right. MemoryWrapper was introduced to force little endian byte order.

AlexanderSaydakov · 2018-09-26T00:23:48Z

SynchronizedUnion wrapper in Theta module was removed since it was not compatible with the new API of Union. Synchronization was added to aggregate() and get() methods in SketchAggregator instead.
Also I noticed that there is no synchronization in the SketchBufferAggregator like in other modules. Is this an oversight or maybe the synchronization is not needed in other modules as well?
@will-lauer do you happen to know?

himanshug · 2018-10-11T17:45:04Z

...tches/src/main/java/org/apache/druid/query/aggregation/datasketches/LittleEndianWrapper.java

I wonder why Memory.wrap(..) and WritableMemory.wrap(..) don't handle this conversion instead of forcing all systems calling those methods?
also, this is kind of error prone in case we miss calling wrap via LittleEndianWrapper and call Memory.wrap(..) directly in some place.

I see some direct calls to Memory.wrap() still remain, like in ArrayOfDoublesSketch.deserializeFromByteArray, do all call sites need to go through the LittleEndianWrapper?

Also, would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?

This is about wrapping a ByteBuffer. New version of the memory library handles the byte order of the incoming ByteBuffer correctly (before it was simplistic and forgiving). But Druid sometimes does not set the byte order correctly. So now it is necessary to force the byte order of sketches to be little endian. And it must be done externally to the memory library because it is really about Druid not setting it correctly.

would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?

This is a good point. I will think about it.

I see some direct calls to Memory.wrap() still remain

They still remain where the input is not a ByteBuffer

so, are we saying that sketch objects are "always" serialized in little endian and when druid loads those bytes in the ByteBuffer, the order is big endian (jvm default) and that needs to be corrected since memory doesn't know that it is getting a sketch object and wouldn't be able to correctly handle the contents otherwise?

Please add comments to these methods on why the wrapping is necessary and why it's only needed for ByteByffer and not byte[]

himanshug · 2018-10-11T17:48:07Z

...es/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java

we call union.update(..) in updateUnion(..) method defined later in the same file which is prone to failure without SynchronizedUnion.
Also, I wouldn't want to put synchronization on that method itself as it gets called in other situations where synchronization isn't necessary.
So, It probably still makes sense to keep SynchronizedUnion class around.

I don't see how synchronizing on Union objects would be avoided in those other situations when synchronization is not necessary.
Could you clarify what is error-prone? We obtain exclusive locks in aggregate() and get() in case they are called at the same time.

It probably still makes sense to keep SynchronizedUnion class around

Given the new API, it is impossible to make SynchronizedUnion a subclass of Union as it was before. But making it just a wrapper breaks a lot of other code, so it was way easier to get rid of it altogether.

you're right , Now I notice that updateUnion(..) is also called from synchronized block. nvmnd.

himanshug · 2018-10-11T17:48:42Z

...g/apache/druid/query/aggregation/datasketches/tuple/ArrayOfDoublesSketchAggregationTest.java

can you please explain this change? given that everything is backwards compatible why did this have to change?

This was a mistake in the test. The number of values in the union and incoming sketches must match, which was not enforced before. Correct code should not be affected.

himanshug · 2018-10-11T17:50:11Z

@AlexanderSaydakov yes, SynchronizedUnion is not needed in buffer aggregators which are never used concurrently. Aggregator on the other hand gets used at "realtime" processes concurrently to index as well as query data.

AlexanderSaydakov · 2018-10-12T00:34:55Z

SynchronizedUnion is not needed in buffer aggregators which are never used concurrently

That is great, if so. But I wonder why in previous reviews of other modules (quantiles in particular) I was advised to use that striped locking in buffer aggregators. Perhaps we should revise those to get rid of unnecessary complexity.

AlexanderSaydakov · 2018-10-12T00:38:06Z

I see that conflicts appeared. I will take care of them in a couple of days.

AlexanderSaydakov · 2018-10-16T22:09:45Z

Rebased to resolve conflicts.

leventov · 2019-01-15T11:20:55Z

@himanshug why do you think that buffer aggregators are never used concurrently? Aren't they in OffheapIncrementalIndex? See #3956 that changes OffheapIncrementalIndex.

gianm · 2019-01-15T15:16:06Z

We might want to remove OffheapIncrementalIndex. It's only used in one place: groupBy v1, if the useOffheap context parameter is set. groupBy v1 has been deprecated for a while, and imo the only reason it's still around is something related to #6743 -- buffer aggregators can't resize themselves currently, so groupBy v2, which uses offheap aggregations, allocates more space than necessary for ones that could grow in theory. groupBy v1 by default uses onheap aggregations and doesn't have this problem, so it can still be useful if your workload is mainly composed of groupBys with very growable sketch objects, and you're memory limited. (It has a bunch of other problems, though. This is really the only good thing it does relative to v2.)

However: groupBy v1 with the useOffheap parameter is, as far as I know, not useful anymore. groupBy v2 should be better in every way. So I think that does make a case for removing OffheapIncrementalIndex.

leventov · 2019-01-16T11:53:56Z

My long term intention is the opposite: remove OnheapIncrementalIndex, along with on-heap Aggregators. Leave only BufferAggregators. It's mentioned here: #5335 (comment), the last paragraph. The umbrella issue is #4622.

AlexanderSaydakov · 2019-01-16T16:09:42Z

Does this imply that we still want to have synchronization in buffer aggregators?

leventov · 2019-01-17T10:18:29Z

@AlexanderSaydakov yes. But with people writing so many different aggregators now, it seems increasingly wrong to make people do this in aggregators code. I think #3956 should become a high priority issue. I like the approach suggested by @himanshug here: #3956 (comment) with boolean isThreadSafe() method in aggregators.

AlexanderSaydakov changed the title ~~updated to use the latest sketches-core-0.12.0~~ datsketches extension updated to use the latest sketches-core-0.12.0 Sep 26, 2018

gianm requested a review from jon-wei September 28, 2018 22:33

gianm assigned jon-wei Sep 28, 2018

himanshug reviewed Oct 11, 2018

View reviewed changes

updated to use the latest sketches-core-0.12.0

bd6067a

AlexanderSaydakov force-pushed the datasketches_0_12_0 branch from 3ff7999 to bd6067a Compare October 16, 2018 21:29

himanshug merged commit ec9d182 into apache:master Oct 23, 2018

jon-wei removed their assignment Nov 5, 2018

AlexanderSaydakov mentioned this pull request Nov 8, 2018

Moments Sketch custom aggregator #6581

Merged

leventov mentioned this pull request Jan 16, 2019

Adds bloom filter aggregator to 'druid-bloom-filters' extension #6397

Merged

jon-wei added this to the 0.14.0 milestone Feb 20, 2019

himanshug mentioned this pull request Jul 10, 2019

force native order when wrapping ByteBuffer #8055

Merged

clintropolis mentioned this pull request Apr 16, 2021

Vectorized versions of HllSketch aggregators. #11115

Merged

Conversation

AlexanderSaydakov commented Sep 25, 2018

Uh oh!

AlexanderSaydakov commented Sep 26, 2018

Uh oh!

AlexanderSaydakov commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug commented Oct 11, 2018

Uh oh!

AlexanderSaydakov commented Oct 12, 2018

Uh oh!

AlexanderSaydakov commented Oct 12, 2018

Uh oh!

AlexanderSaydakov commented Oct 16, 2018

Uh oh!

leventov commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 15, 2019

Uh oh!

leventov commented Jan 16, 2019

Uh oh!

AlexanderSaydakov commented Jan 16, 2019

Uh oh!

leventov commented Jan 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jon-wei Oct 11, 2018 •

edited

Loading

himanshug Oct 12, 2018 •

edited

Loading

leventov commented Jan 15, 2019 •

edited

Loading