-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5835: [Java] Support Dictionary Encoding for binary type #4792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4792 +/- ##
==========================================
+ Coverage 87.43% 89.59% +2.16%
==========================================
Files 996 661 -335
Lines 139677 96300 -43377
Branches 1418 0 -1418
==========================================
- Hits 122124 86282 -35842
+ Misses 17191 10018 -7173
+ Partials 362 0 -362Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems String is also a good wrapper for byte array, with hashCode & equals properly defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments, @liyafan82 . Sure we could use String as the wrapper, but String will convert byte[] to char[] which I'm afraid it will affect performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a ByteBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, fixed now, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to wrap the underlying data directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Micah, do you mean create a wrapper class to hold byte[] like ByteArrayWrapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explaining my comment more on the last review about the custom hash-table:
Instead of this approach it might be nice to see we can somehow use either the comparators that @liyafan82 has introduced and also introduce a concept of a Hasher. If we use those, then we can avoid calling the getObject() method at all, which in some cases might be expensive? (string copying?). What do you thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield , I agree with you.
There are significant performance overhead here, due to repeated memory copy, and conversions between java object & binary bytes. So some significant rework is required before it can be applied to our scenario.
We are preparing for the rework, the sort/search functionalities are prerequisites for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. @emkornfield Let me understand, you suggest something like that to remove lookup HashMap, right?
for (int i = 0; i < count; i++) {
//vector represents the vector to encode
Object value = vector.getObject(i);
int index = dictionary.getVector().search(value); //comparators or hasher
encodedVector.setWithPossibleTruncate(i, index);
}
In some cases, will search perform worse than getObject? Since comparators are not supported supported or will introduce many changes with a concept of a Hasher, I would prefer to test and work for this in a follow-up PR, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:
class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions
- Do we want to move comparator to this module?
- Does this really give a performance benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And 3. Should there be either a new method on ValueVector or interface for direct equality checking instead of using comparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your prototype, we need carefully think of the follow-up design and test the perf, BWT, I think it's a good start whatever method is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up PR is fine because I think it . Not necessarily with search. I was thinking of still having a hash table. But something like:
class Hasher { ValueVector wrappedVector; // specialized per vector type. Avoids returning an object // so no object creation is required. abstract int getHash(int index); }
Then the HashTable would have a have comparator:
HashTable {
Comparator comparator;
Hasher hasher;
int getIndex(int arrayToEncodeIndex) {
entries = table.get(Hasher(arrayToEncodeindex));
for (entriy : entries) {
if (comparator(entry.index, arrayToEncodeIndex) == 0){
return entry.index
}
}
return -1;
}
}
`
A few open questions
- Do we want to move comparator to this module?
- Does this really give a performance benefit?
@emkornfield , thanks for your comments. Since dictionary encoding is key to the performance of our scenario, I would like provide some comments:
-
We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.
-
A big +1 for your Hash interface. It will be of great help for 1) reducing the conversions between Java objects and Arrow bufer; 2) avoiding unnecessary memory copy.
In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.
- Equality can be determined in two ways: 1) by a comparator; 2) by hashCode + equals. I think it is OK to add a member for ValueVector to provide the default equality behavior.
However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move the dictionary related code to the algorithm module, but not vice versa, because a cyclic dependency will be created.
For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.
In addition, another lower level hasher based on memory buffer should be provided. I will start a new issue to track that.
SGTM
I think it is OK to add a member for ValueVector to provide the default equality behavior.
I tend to agree that an equality method would be helpful, I need to go back and look through the classes again because I'm surprised there isn't something already available.
However, it is also beneficial to provide some interface for calculating the hash code. According to our experience, different algorithms for computing the hash code manifest widely different behaviors and have significant performance implications, so they are suitable for different scenarios.
Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now to avoid API breakages I would prefer to leave it here. Once we have more examples of dependencies we can figure out the best way to factor the code.
Sure. Sounds reasonable.
My point is to leave this one alone and gradually deprecated it, and create a new one in the algorithm module. The reason is that the current one can hardly be used in any practical scenario, because of heavy performance overhead:
- There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).
- Unnecessary memory copy (the vector data must be copied to the hash table).
- The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
- The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
- The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
I will start a new issue to redesign the dictionary encoder.
Can you elaborate on the different scenarios is it more than just varying by data type? How do you choose a hasher for any particular scenario.
Sure.
Generally, we want the hash code to be uniformly distributed in the universe. This will have great benefits in practice. For example, in a hash table with open addressing, a uniform hash function will lead to small number of collisions, short bucket clusters, etc., which makes insert/search/update operations for the hash table much faster.
A uniform hash function, however, is usually compute-intensive. On the other hand, a simple hash function can be easy to compute, but it causes severe problem in practice (e.g. in hash table operations). Both hash functions are valid in the sense that if two objects are equal, they must have identical hash code.
We have such a experience, our hash join operator performs at least two orders of magnitude slower, just because of a poorly selected hash function.
Therefore, the hash function has significant performance implications. The key is the balance between being uniform and computational cost, and this balance should depend on concrete scenario.
So we should not rely on a single hash function. We should give the user the ability to plug-in the hash function as they like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment is that, we should be careful when introducing new interfaces, especially vector interfaces, since they are the core for Arrow.
New interfaces make the class hierarchy difficult to manage, and once an interface is added, it is difficult to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you we should be careful introducing new interfaces, if you all think this is not needed, it can be removed. But in some case we might use two "if" to judge the type separately like DictionaryEncoder#encode in this PR which seems a little ugly.
|
+1, thanks |
|
Actually, I would like to get a second opinion on the interface addition before merging @praveenbingo @pravindra? |
|
@praveenbingo @pravindra any concerns about the new interface? |
|
@tianchen92 you'll need to fix the conflicts from merging your other PR. Thanks |
Thanks for your efforts, fixed now. |
I'll review this over the weekend. |
java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java
Show resolved
Hide resolved
pravindra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
+1, CI failure appears to be flight failure. |
… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835). Now is not implemented because byte array is not supported to be HashMap key. One possible way is that wrap them with something to implement equals and hashcode. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4792 from tianchen92/ARROW-5835 and squashes the following commits: f50a19e <tianchen> fix UNION regression 8267c2b <tianchen> fix style a039bc1 <tianchen> Support Dictionary Encoding for binary type
… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
Related to [ARROW-5835](https://issues.apache.org/jira/browse/ARROW-5835). Now is not implemented because byte array is not supported to be HashMap key. One possible way is that wrap them with something to implement equals and hashcode. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4792 from tianchen92/ARROW-5835 and squashes the following commits: f50a19e <tianchen> fix UNION regression 8267c2b <tianchen> fix style a039bc1 <tianchen> Support Dictionary Encoding for binary type
… dictionary encoding As discussed in apache#4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
Related to ARROW-5835.
Now is not implemented because byte array is not supported to be HashMap key.
One possible way is that wrap them with something to implement equals and hashcode.