ARROW-5835: [Java] Support Dictionary Encoding for binary type #4792

tianchen92 · 2019-07-03T12:41:43Z

Related to ARROW-5835.
Now is not implemented because byte array is not supported to be HashMap key.
One possible way is that wrap them with something to implement equals and hashcode.

codecov-io · 2019-07-03T21:05:39Z

Codecov Report

Merging #4792 into master will increase coverage by 2.16%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4792      +/-   ##
==========================================
+ Coverage   87.43%   89.59%   +2.16%     
==========================================
  Files         996      661     -335     
  Lines      139677    96300   -43377     
  Branches     1418        0    -1418     
==========================================
- Hits       122124    86282   -35842     
+ Misses      17191    10018    -7173     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 325 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9a82ec...cdbcf41. Read the comment docs.

liyafan82 · 2019-07-05T03:21:08Z

java/vector/src/main/java/org/apache/arrow/vector/dictionary/ByteArrayWrapper.java

It seems String is also a good wrapper for byte array, with hashCode & equals properly defined?

Thanks for your comments, @liyafan82 . Sure we could use String as the wrapper, but String will convert byte[] to char[] which I'm afraid it will affect performance.

How about a ByteBuffer?

Seems reasonable, fixed now, thanks a lot!

emkornfield · 2019-07-05T17:38:09Z

java/vector/src/main/java/org/apache/arrow/vector/VarBinaryVector.java

is it possible to wrap the underlying data directly?

Hi Micah, do you mean create a wrapper class to hold byte[] like ByteArrayWrapper?

emkornfield · 2019-07-05T17:44:15Z

java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java

Explaining my comment more on the last review about the custom hash-table:
Instead of this approach it might be nice to see we can somehow use either the comparators that @liyafan82 has introduced and also introduce a concept of a Hasher. If we use those, then we can avoid calling the getObject() method at all, which in some cases might be expensive? (string copying?). What do you thing?

@emkornfield , I agree with you.
There are significant performance overhead here, due to repeated memory copy, and conversions between java object & binary bytes. So some significant rework is required before it can be applied to our scenario.
We are preparing for the rework, the sort/search functionalities are prerequisites for this.

Thanks for your comments. @emkornfield Let me understand, you suggest something like that to remove lookup HashMap, right?

for (int i = 0; i < count; i++) {
//vector represents the vector to encode
Object value = vector.getObject(i);
int index = dictionary.getVector().search(value); //comparators or hasher
encodedVector.setWithPossibleTruncate(i, index);
}

In some cases, will search perform worse than getObject? Since comparators are not supported supported or will introduce many changes with a concept of a Hasher, I would prefer to test and work for this in a follow-up PR, what do you think?

Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:

class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions

Do we want to move comparator to this module?

Does this really give a performance benefit?

And 3. Should there be either a new method on ValueVector or interface for direct equality checking instead of using comparator?

Thanks a lot for your prototype, we need carefully think of the follow-up design and test the perf, BWT, I think it's a good start whatever method is used.

Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:

class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions

Do we want to move comparator to this module?

Does this really give a performance benefit?

@emkornfield , thanks for your comments. Since dictionary encoding is key to the performance of our scenario, I would like provide some comments:

We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.

A big +1 for your Hash interface. It will be of great help for 1) reducing the conversions between Java objects and Arrow bufer; 2) avoiding unnecessary memory copy.

In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.

Equality can be determined in two ways: 1) by a comparator; 2) by hashCode + equals. I think it is OK to add a member for ValueVector to provide the default equality behavior.

However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.

@liyafan82

We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.

For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.

In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.

SGTM

I think it is OK to add a member for ValueVector to provide the default equality behavior.

I tend to agree that an equality method would be helpful, I need to go back and look through the classes again because I'm surprised there isn't something already available.

However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.

Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario

For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.

Sure. Sounds reasonable.
My point is to leave this one alone and gradually deprecated it, and create a new one in the algorithm module. The reason is that the current one can hardly be used in any practical scenario, because of heavy performance overhead:

There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).

Unnecessary memory copy (the vector data must be copied to the hash table).

The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).

The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)

The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

I will start a new issue to redesign the dictionary encoder.

Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario.

Sure.
Generally, we want the hash code to be uniformly distributed in the universe. This will have great benefits in practice. For example, in a hash table with open addressing, a uniform hash function will lead to small number of collisions, short bucket clusters, etc., which makes insert/search/update operations for the hash table much faster.

A uniform hash function, however, is usually compute-intensive. On the other hand, a simple hash function can be easy to compute, but it causes severe problem in practice (e.g. in hash table operations). Both hash functions are valid in the sense that if two objects are equal, they must have identical hash code.

We have such a experience, our hash join operator performs at least two orders of magnitude slower, just because of a poorly selected hash function.

Therefore, the hash function has significant performance implications. The key is the balance between being uniform and computational cost, and this balance should depend on concrete scenario.

So we should not rely on a single hash function. We should give the user the ability to plug-in the hash function as they like.

Conversations and memory copy could be avoidable with new designed hash table & Hasher interface and new hash & equals API without changing decoder interface. #4846, #4844
+1 for provide multiple hash code implementation.

liyafan82 · 2019-07-08T01:55:54Z

java/vector/src/main/java/org/apache/arrow/vector/BaseBinaryVector.java

A general comment is that, we should be careful when introducing new interfaces, especially vector interfaces, since they are the core for Arrow.
New interfaces make the class hierarchy difficult to manage, and once an interface is added, it is difficult to remove it.

I agree with you we should be careful introducing new interfaces, if you all think this is not needed, it can be removed. But in some case we might use two "if" to judge the type separately like DictionaryEncoder#encode in this PR which seems a little ugly.

emkornfield · 2019-07-10T06:19:56Z

+1, thanks

emkornfield · 2019-07-10T06:44:33Z

Actually, I would like to get a second opinion on the interface addition before merging @praveenbingo @pravindra?

emkornfield · 2019-07-12T07:36:56Z

@praveenbingo @pravindra any concerns about the new interface?

emkornfield · 2019-07-12T07:37:25Z

@tianchen92 you'll need to fix the conflicts from merging your other PR. Thanks

tianchen92 · 2019-07-12T07:48:20Z

@tianchen92 you'll need to fix the conflicts from merging your other PR. Thanks

Thanks for your efforts, fixed now.

pravindra · 2019-07-12T11:09:37Z

@praveenbingo @pravindra any concerns about the new interface?

I'll review this over the weekend.

java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java

pravindra

lgtm

emkornfield · 2019-07-16T02:55:54Z

+1, CI failure appears to be flight failure.

… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype

Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835). Now is not implemented because byte array is not supported to be HashMap key. One possible way is that wrap them with something to implement equals and hashcode. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4792 from tianchen92/ARROW-5835 and squashes the following commits: f50a19e <tianchen> fix UNION regression 8267c2b <tianchen> fix style a039bc1 <tianchen> Support Dictionary Encoding for binary type

… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype

Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835). Now is not implemented because byte array is not supported to be HashMap key. One possible way is that wrap them with something to implement equals and hashcode. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4792 from tianchen92/ARROW-5835 and squashes the following commits: f50a19e <tianchen> fix UNION regression 8267c2b <tianchen> fix style a039bc1 <tianchen> Support Dictionary Encoding for binary type

… dictionary encoding As discussed in apache#4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype

kou force-pushed the master branch from e3ae1b0 to 03576af Compare July 4, 2019 05:51

tianchen92 force-pushed the ARROW-5835 branch from cdbcf41 to 6c1bc15 Compare July 4, 2019 06:40

liyafan82 reviewed Jul 5, 2019

View reviewed changes

emkornfield force-pushed the ARROW-5835 branch from e9e1f53 to 0b56ff2 Compare July 5, 2019 09:02

emkornfield reviewed Jul 5, 2019

View reviewed changes

liyafan82 reviewed Jul 8, 2019

View reviewed changes

tianchen92 force-pushed the ARROW-5835 branch from 0b56ff2 to d14ac28 Compare July 9, 2019 08:00

This was referenced Jul 10, 2019

ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding #4846

Closed

ARROW-1184: [Java] Dictionary.equals is not working correctly #4843

Closed

Support Dictionary Encoding for binary type

a039bc1

tianchen92 force-pushed the ARROW-5835 branch from d14ac28 to a039bc1 Compare July 12, 2019 07:46

fix style

8267c2b

fsaintjacques added the Component: Java label Jul 12, 2019

emkornfield reviewed Jul 13, 2019

View reviewed changes

java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java Show resolved Hide resolved

fix UNION regression

f50a19e

pravindra approved these changes Jul 14, 2019

View reviewed changes

emkornfield closed this in a222c7d Jul 16, 2019

This was referenced Aug 1, 2019

[Java] Support Dictionary Encoding for binary type #22254

Closed

[Java] Implement hash table and equals & hashCode API for dictionary encoding #22315

Closed

ARROW-5835: [Java] Support Dictionary Encoding for binary type #4792

ARROW-5835: [Java] Support Dictionary Encoding for binary type #4792

Uh oh!

Conversation

tianchen92 commented Jul 3, 2019

Uh oh!

codecov-io commented Jul 3, 2019

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Jul 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Jul 10, 2019

Uh oh!

emkornfield commented Jul 12, 2019

Uh oh!

emkornfield commented Jul 12, 2019

Uh oh!

tianchen92 commented Jul 12, 2019

Uh oh!

pravindra commented Jul 12, 2019

Uh oh!

Uh oh!

pravindra left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

emkornfield commented Jul 10, 2019 •

edited

Loading