-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1816: [Java] Resolve new vector classes structure for timestamp, date and maybe interval #1330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is a RFC. I think the resulting timestamp vector (NullableTimestampVector) is still branch-free at the cell level. cc @jacques-n can you take a look and let me know what you think? |
|
@wesm @jacques-n @siddharthteotia Timing wise I should be able to make this into 0.8 if we are happy about this approach. |
|
On first look this makes sense. However, if I'm reading this right, the reader/writer implementations have lost the ability to set the right unit. How does one do this? |
|
@jacques-n I am fixing that part. |
f10b834 to
1e87e9d
Compare
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as there is no performance degredation. My biggest concern is that maybe there are some operations that now need to check for compatible time units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were previously flipped in ARROW-1091 and then flipped again in ARROW-1092. I thought they were right already, are they not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reverse ordering the these typeParams are driving me nuts... I changed it such that they follow the same order everywhere. Also see: https://github.com/apache/arrow/pull/1330/files#r152436539
The end results are the same because flipped twice..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the friendlyType param still used any more? If so then will LocalDateTime work with a timezone set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used. The way it currently works is if timezone is specified, if will return the LocalDateTime in the specified time zone.
For example:
value = 0, timezone = null
returns LocalDateTime(1970-01-01 00:00:00)
value = 0, timezone = "America/New_York"
returns LocalDateTime(1969-12-31 19:00:00)
java/vector/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this needs to be added now? If it does can it be scoped to compile phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be build dependency. Before it was using 2.3.21.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So without specifying the version it defaults to 2.3.21? Was that version causing an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that's the version comes with the plugin.
I used <#sep> notation in the template to created comma separated arg list. The notation is introduced in 2.3.23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making param order more sane.. It's driving me nuts we are flipping the ordering twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe move this import up with DateTimeZone, and move the TimeUnit down with other arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need that checkstyle :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these set methods need to check that the type params are compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Although it does add branching on the array. Maybe don't check for performance reason, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to leave this as it was and just put a switch statement in getTimeUnit to return the corresponding java TimeUnit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why that's better...Why do you think so? Maybe I missed sth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well before the enum was defined by the flatbufId, but here it is the flatbufId and the Java TimeUnit which is redundant information. So a check for equality would have to look at both of these fields. Maybe not a big deal though..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum equality should just be identity check:
https://stackoverflow.com/questions/533922/is-it-ok-to-use-on-enums-in-java
|
@jacques-n During the refactoring, I discovered that union vector doesn't support types with type params (i.e. Decimal and Timestamp), so I fixed that as well. Now union vectors support decimal as well. The PR now includes two major changes:
Most of the template change is to support types with type params, most of them replaces: to Please take a look when you have chance. Thanks! Hopefully we can clean this up and merge the week after Thanksgiving. |
|
@BryanCutler Thanks for the code review. I will clean this up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making type params order more sane...
|
@BryanCutler I am a bit reluctant to check |
Yeah it doesn't make sense to do all these checks each access, so I just wanted to pose the question to make sure it wasn't a blocker for doing this refactor. I don't use the holder APIs so it's fine with me but maybe @siddharthteotia has some thoughts on this? |
dc04f92 to
9732486
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more efficient to have a toMillis method on TimeUnit (and encapsulate this detail per law of demeter)? I guess we need to run some microbenchmarks to be able to judge performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I encapsulated java TimeUnit detail.
Also I ran a microbench toMilliis for 1B long values. Performance is the same.
|
Needs rebase after ARROW-1710. If we are going to change this class structure, 0.8.0 would be the right time to do it. I'm concerned about delaying the release much further though -- if we can't get the Java work wrapped up this week, we can probably take another week, but we're getting into the red zone as far as the Spark 2.3.0 timeline is concerned |
9ab65d5 to
b20f3d0
Compare
|
@wesm I rebased after ARROW-1770. This should be ready. @jacques-n can you take a look? I'd really like this to get in 0.8 if this looks good. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is possible to just have getWriter(ArrowType)? Having 2 methods so similar is a little confusing.. This is because you need a writer for timestamp type params too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree ideally we should have just one. However currently, this is needed to support this:
TimestampWriter timestampWriter = rootWriter.timestamp("a", TimeUnit.SECOND, "America/New_York");
timestampWriter.writeTimestamp(1000)
The second line here, the type of "timestampWriter" is a promotable writer and it needs to lookup the writer by MinorType.TIMESTAMP
We need to do more refactoring to support this API without abstract protected FieldWriter getWriter(MinorType type), but since this is going to be internal change and this PR is already quite complicated, I prefer this to be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good
compiles. Need to fix tests. Fix tests Tests passed Fix error Remove println Rebase Fix union reader/writer. Add more tests Add testTimestampVector test; Modify unionWriter template Hide java TimeUnit in arrow TimeUnit
c04f15b to
6c33285
Compare
Overview
This PR includes these major changes:
The second is needed because currently Union vector doesn't support non-simple minor type such as decimal. Because we make timestamp type also a non-simple minor type, we need this change to support timestamp and decimal in Union.
Proposed API Changes
MinorType.TIMESTAMP*to a singleMinorType.TIMESTAMPthat doesn't supportgetType()(same asDECIMALminorType)TimeStamp*Writerto a singleTimestampWriterTimeStamp*Readerto a singleTimestampReaderUnionWriter.writeTimestamp*(long)toUnionWriter.asTimestamp(unit, timezone).writeTimestamp(long)ListWriter.timeStamp*()toListWriter.timestamp(unit, timezone)MapWriter.timeStamp*(name)toMapWriter.timestamp(name, unit, timezone)UnionVector.getTimeStamp*VectortoUnionVector.getTimestampVector()andUnionVector.getTimestampVector(unit, timezone).UnionVector.getTimestampVector()will return the existing timestamp vector or throw exception if it doesn't exist.UnionVector.getTimestampVector(unit, timezone)will return the existing timestamp vector or create one if it doesn't exist.Added API
UnionVector.getDecimalVector()andUnionVector.getDecimalVector(precision, scale)ListWriter.decimal(precision, scale)MapWriter.decimal(name, precision, scale)UnionWriter.asDecimal(precision, scale)andUnionWriter.getDecimalWriter(precision, scale)TODO: