Skip to content

datsketches extension updated to use the latest sketches-core-0.12.0#6381

Merged
himanshug merged 1 commit intoapache:masterfrom
AlexanderSaydakov:datasketches_0_12_0
Oct 23, 2018
Merged

datsketches extension updated to use the latest sketches-core-0.12.0#6381
himanshug merged 1 commit intoapache:masterfrom
AlexanderSaydakov:datasketches_0_12_0

Conversation

@AlexanderSaydakov
Copy link
Contributor

Everything must be compatible

@AlexanderSaydakov AlexanderSaydakov changed the title updated to use the latest sketches-core-0.12.0 datsketches extension updated to use the latest sketches-core-0.12.0 Sep 26, 2018
@AlexanderSaydakov
Copy link
Contributor Author

sketches-core-0.12.0 uses memory-0.12.0, which forced most of the changes. In particular, new memory is sensitive to the byte order of ByteBuffer received from Druid, which is not always right. MemoryWrapper was introduced to force little endian byte order.

@AlexanderSaydakov
Copy link
Contributor Author

SynchronizedUnion wrapper in Theta module was removed since it was not compatible with the new API of Union. Synchronization was added to aggregate() and get() methods in SketchAggregator instead.
Also I noticed that there is no synchronization in the SketchBufferAggregator like in other modules. Is this an oversight or maybe the synchronization is not needed in other modules as well?
@will-lauer do you happen to know?

@gianm gianm requested a review from jon-wei September 28, 2018 22:33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why Memory.wrap(..) and WritableMemory.wrap(..) don't handle this conversion instead of forcing all systems calling those methods?
also, this is kind of error prone in case we miss calling wrap via LittleEndianWrapper and call Memory.wrap(..) directly in some place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some direct calls to Memory.wrap() still remain, like in ArrayOfDoublesSketch.deserializeFromByteArray, do all call sites need to go through the LittleEndianWrapper?

Copy link
Contributor

@jon-wei jon-wei Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about wrapping a ByteBuffer. New version of the memory library handles the byte order of the incoming ByteBuffer correctly (before it was simplistic and forgiving). But Druid sometimes does not set the byte order correctly. So now it is necessary to force the byte order of sketches to be little endian. And it must be done externally to the memory library because it is really about Druid not setting it correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be simpler to use public static Memory wrap(final ByteBuffer byteBuf, final ByteOrder byteOrder) here?

This is a good point. I will think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some direct calls to Memory.wrap() still remain

They still remain where the input is not a ByteBuffer

Copy link
Contributor

@himanshug himanshug Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, are we saying that sketch objects are "always" serialized in little endian and when druid loads those bytes in the ByteBuffer, the order is big endian (jvm default) and that needs to be corrected since memory doesn't know that it is getting a sketch object and wouldn't be able to correctly handle the contents otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to these methods on why the wrapping is necessary and why it's only needed for ByteByffer and not byte[]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call union.update(..) in updateUnion(..) method defined later in the same file which is prone to failure without SynchronizedUnion.
Also, I wouldn't want to put synchronization on that method itself as it gets called in other situations where synchronization isn't necessary.
So, It probably still makes sense to keep SynchronizedUnion class around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how synchronizing on Union objects would be avoided in those other situations when synchronization is not necessary.
Could you clarify what is error-prone? We obtain exclusive locks in aggregate() and get() in case they are called at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably still makes sense to keep SynchronizedUnion class around

Given the new API, it is impossible to make SynchronizedUnion a subclass of Union as it was before. But making it just a wrapper breaks a lot of other code, so it was way easier to get rid of it altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right , Now I notice that updateUnion(..) is also called from synchronized block. nvmnd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please explain this change? given that everything is backwards compatible why did this have to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake in the test. The number of values in the union and incoming sketches must match, which was not enforced before. Correct code should not be affected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@himanshug
Copy link
Contributor

@AlexanderSaydakov yes, SynchronizedUnion is not needed in buffer aggregators which are never used concurrently. Aggregator on the other hand gets used at "realtime" processes concurrently to index as well as query data.

@AlexanderSaydakov
Copy link
Contributor Author

SynchronizedUnion is not needed in buffer aggregators which are never used concurrently

That is great, if so. But I wonder why in previous reviews of other modules (quantiles in particular) I was advised to use that striped locking in buffer aggregators. Perhaps we should revise those to get rid of unnecessary complexity.

@AlexanderSaydakov
Copy link
Contributor Author

I see that conflicts appeared. I will take care of them in a couple of days.

@AlexanderSaydakov
Copy link
Contributor Author

Rebased to resolve conflicts.

@himanshug himanshug merged commit ec9d182 into apache:master Oct 23, 2018
@jon-wei jon-wei removed their assignment Nov 5, 2018
@leventov
Copy link
Member

leventov commented Jan 15, 2019

@himanshug why do you think that buffer aggregators are never used concurrently? Aren't they in OffheapIncrementalIndex? See #3956 that changes OffheapIncrementalIndex.

@gianm
Copy link
Contributor

gianm commented Jan 15, 2019

We might want to remove OffheapIncrementalIndex. It's only used in one place: groupBy v1, if the useOffheap context parameter is set. groupBy v1 has been deprecated for a while, and imo the only reason it's still around is something related to #6743 -- buffer aggregators can't resize themselves currently, so groupBy v2, which uses offheap aggregations, allocates more space than necessary for ones that could grow in theory. groupBy v1 by default uses onheap aggregations and doesn't have this problem, so it can still be useful if your workload is mainly composed of groupBys with very growable sketch objects, and you're memory limited. (It has a bunch of other problems, though. This is really the only good thing it does relative to v2.)

However: groupBy v1 with the useOffheap parameter is, as far as I know, not useful anymore. groupBy v2 should be better in every way. So I think that does make a case for removing OffheapIncrementalIndex.

@leventov
Copy link
Member

My long term intention is the opposite: remove OnheapIncrementalIndex, along with on-heap Aggregators. Leave only BufferAggregators. It's mentioned here: #5335 (comment), the last paragraph. The umbrella issue is #4622.

@AlexanderSaydakov
Copy link
Contributor Author

Does this imply that we still want to have synchronization in buffer aggregators?

@leventov
Copy link
Member

@AlexanderSaydakov yes. But with people writing so many different aggregators now, it seems increasingly wrong to make people do this in aggregators code. I think #3956 should become a high priority issue. I like the approach suggested by @himanshug here: #3956 (comment) with boolean isThreadSafe() method in aggregators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants