-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1710: [Java] Remove Non-Nullable Vectors #1341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1710: [Java] Remove Non-Nullable Vectors #1341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly different API with the new BitVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's now a nullable BitVector, can't get values where a null is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value checked here seem wrong before, am I missing something? There are 4 values set out of 1024 total, so 1020 null values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is old (non-nullable vector) checks how many values are "0" rather than how many values are "set", so when you set two values to "0", they are still counted in NullCount.
I think the new behavior makes more sense, although this could be backward incompatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would make sense, but sort of a misnomer imo. Should we include this change in the release notes somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is an misnomer. I think this is fine as long as it doesn't break Dremio. cc @siddharthteotia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For migration purpose, should me provide an alternative method to provide the old behavior? Something like getNullAndUnsetCount(). Seems like a weird API though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this value seemed wrong before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't check that a null entry has any type of value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this seemed wrong before... since 4 values are set earlier, then all 4 should be unset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, slightly different API, need to check range manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider add setRangeToOne back in BitVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? it is checking only the data buffer capacity i believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is checking DataBuffer and the check after reAlloc is checking OffsetBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking the offset buffer was a typo on my part, this should be checking the data buffer capacity - fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what the test is checking for?
|
@siddharthteotia @icexelloss please take a look when you have a chance. There are still some things I was not sure of so I marked this as a WIP:
cc @jacques-n @wesm |
|
Given the scope of this, I would be nice to get it and resolve the above issues later as long as this doesn't break anything - looks like I already need to do a rebase.. |
|
@icexelloss I didn't see you were already working on #1330, which is also a sizable change, before I started on this. I'm not sure which one would be easier to merge first, any thoughts? |
|
@BryanCutler There might be some conflicts but I think we should be fine. |
|
If this is merged first. I would just delete different files, shouldn't be hard. If #1330 get merged first, you need to rename NullableTimestampVector -> TimestampVector. |
685cb46 to
e96c11f
Compare
|
@BryanCutler Good work! High level looks good. I will try to look more closely tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is NonNullableMapVector something new? Can you explain the change of class hierarchy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, the original MapVector has now become NonNullableMapVector and NullableMapVector (subclass of of MapVector) has now become MapVector. @BryanCutler can confirm.
I would prefer to not change the naming here since there are essentially two classes -- NullableMapVector and MapVector where the former one is a MapVector with a validity buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from changing the name of NullableInt to Int since we are going to remove the latter one.
For the MapVector case, I don't think so we are removing anything and instead keeping around both -- base class MapVector and subclass NullableMapVector. So probably no need to change name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the above is correct and what I was referencing in (1) from #1341 (comment). Maybe I missed the discussion, so I thought we would be removing the NonNullableMapVector and would combine the 2 classes. Why do we need to keep both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not "MinorType"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this must have been auto-corrected, let me check on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the other vector classes also use Types.MinorType. I agree, it would be better to just say MinorType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler I think this is missed. Can you please fix this?
|
@BryanCutler High level looks good to me. Two major comments:
Happy thanksgiving! |
I was thinking this is what we were going to do, maybe as a followup though, so +1 for me. One thing I'm not sure of is ullability useful for this vector? For example, Spark StructType doesn't have a nullable param, it's up to the child type definitions. But maybe it's different in the Arrow sense, I'll have to think about that.. Happy Thanksgiving to you all too! You guys work too hard, I don't want to see any PRs coming through for a couple days :) |
I think in Spark, nullable ability is part of |
|
Needs rebase |
|
Oh yeah, you're right about Spark struct, duh! So should I change the naming of |
b7a6930 to
ff2120d
Compare
|
I am still not sure why do we need both nullable and non-nullable map vectors. @siddharthteotia can you elaborate? If this is the only thing blocking the PR. Maybe we can merge this and address it later. |
|
I removed the WIP tag because I think this is ok to merge unless @siddharthteotia objects to the changes of |
|
This build is passing, @jacques-n @siddharthteotia ? |
|
LGTM except for changing |
This wasn't added by me and it's the same for the other vector classes, maybe change all in a followup? |
I see. Yeah this is fine. |
|
What are we doing with NonNullableMapVector, MapVector? (1) Either we leave MapVector and NullableMapVector as is where the latter is a subclass. In this case, there are no code changes needed in these two modules. OR (2) We can combine them and just have a MapVector with a validity buffer. I don't think there is a need to introduce a new class NonNullableMapVector in the hierarchy. My assumption with this patch was all nullables will become non-nullables and we will get rid of FixedValueVectors.java template -- this is for scalars. The only complex vector that has 2 different versions is MapVector and I feel we can handle them in either of the two ways mentioned above. |
|
@siddharthteotia , I will do (2) in a followup and remove |
|
Merging tomorrow morning if no objections |
|
+1. Is there a JIRA already for the follow up work? |
|
Thanks all! I will make the follow up JIRAs now |
|
Ok I made these for followup:
|
This removes non-nullable vectors that are no longer part of the vector class hierarchy and renames Nullable*Vector classes to remove the Nullable prefix. Author: Bryan Cutler <cutlerb@gmail.com> Closes #1341 from BryanCutler/java-nullable-vector-rename-ARROW-1710 and squashes the following commits: 7d930dc [Bryan Cutler] fixed realloc test ff2120d [Bryan Cutler] clean up test 374dfcc [Bryan Cutler] properly rename BitVector file 6b7a85e [Bryan Cutler] remove old BitVector.java before rebase 089f7fc [Bryan Cutler] some minor cleanup 4e580d9 [Bryan Cutler] removed legacy BitVector 74f771f [Bryan Cutler] fixed remaining tests 8c5dfef [Bryan Cutler] fix naming in support classes 6e498e5 [Bryan Cutler] removed nullable prefix dfed444 [Bryan Cutler] removed non-nullable vectors
Changes to MapVector were not part of the intended goal for ARROW-1710 and should be restored to `NullableMapVector` -> `MapVector` -> `AbstractMapVector` as existed prior to #1341. Author: Bryan Cutler <cutlerb@gmail.com> Closes #1389 from BryanCutler/java-restore-MapVector-class-ARROW-1885 and squashes the following commits: 9ff503a [Bryan Cutler] restore MapVector class names 3d818f0 [Bryan Cutler] renamed NonNullableMapVector to MapVector ced1705 [Bryan Cutler] renamed MapVector to NullableMapVector
This removes non-nullable vectors that are no longer part of the vector class hierarchy and renames Nullable*Vector classes to remove the Nullable prefix. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#1341 from BryanCutler/java-nullable-vector-rename-ARROW-1710 and squashes the following commits: 7d930dc [Bryan Cutler] fixed realloc test ff2120d [Bryan Cutler] clean up test 374dfcc [Bryan Cutler] properly rename BitVector file 6b7a85e [Bryan Cutler] remove old BitVector.java before rebase 089f7fc [Bryan Cutler] some minor cleanup 4e580d9 [Bryan Cutler] removed legacy BitVector 74f771f [Bryan Cutler] fixed remaining tests 8c5dfef [Bryan Cutler] fix naming in support classes 6e498e5 [Bryan Cutler] removed nullable prefix dfed444 [Bryan Cutler] removed non-nullable vectors
Changes to MapVector were not part of the intended goal for ARROW-1710 and should be restored to `NullableMapVector` -> `MapVector` -> `AbstractMapVector` as existed prior to apache#1341. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#1389 from BryanCutler/java-restore-MapVector-class-ARROW-1885 and squashes the following commits: 9ff503a [Bryan Cutler] restore MapVector class names 3d818f0 [Bryan Cutler] renamed NonNullableMapVector to MapVector ced1705 [Bryan Cutler] renamed MapVector to NullableMapVector
This removes non-nullable vectors that are no longer part of the vector class hierarchy and renames Nullable*Vector classes to remove the Nullable prefix.