Skip to content

Conversation

@icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Nov 18, 2017

Overview

This PR includes these major changes:

  • Consolidate all timestamp vectors into a single vector class
  • Add support for non-simple minor type (i.e., decimal, timestamp) in Union

The second is needed because currently Union vector doesn't support non-simple minor type such as decimal. Because we make timestamp type also a non-simple minor type, we need this change to support timestamp and decimal in Union.

Proposed API Changes

  • Merge TimeStamp*Vector to a single class TimestampVector
  • Rename "timeStamp" to "timestamp" and "TimeStamp" to "Timestamp" across the API for better consistency with Java naming conversions of "time stamp"
  • Merge MinorType.TIMESTAMP* to a single MinorType.TIMESTAMP that doesn't support getType() (same as DECIMAL minorType)
  • Merge TimeStamp*Writer to a single TimestampWriter
  • Merge TimeStamp*Reader to a single TimestampReader
  • Change UnionWriter.writeTimestamp*(long) to UnionWriter.asTimestamp(unit, timezone).writeTimestamp(long)
  • Change ListWriter.timeStamp*() to ListWriter.timestamp(unit, timezone)
  • Change MapWriter.timeStamp*(name) to MapWriter.timestamp(name, unit, timezone)
  • Change UnionVector.getTimeStamp*Vector to UnionVector.getTimestampVector() and UnionVector.getTimestampVector(unit, timezone). UnionVector.getTimestampVector() will return the existing timestamp vector or throw exception if it doesn't exist. UnionVector.getTimestampVector(unit, timezone) will return the existing timestamp vector or create one if it doesn't exist.

Added API

  • Add UnionVector.getDecimalVector() and UnionVector.getDecimalVector(precision, scale)
  • Add ListWriter.decimal(precision, scale)
  • Add MapWriter.decimal(name, precision, scale)
  • Add UnionWriter.asDecimal(precision, scale) and UnionWriter.getDecimalWriter(precision, scale)

TODO:

  • Add more test

@icexelloss
Copy link
Contributor Author

icexelloss commented Nov 18, 2017

This PR is a RFC.

I think the resulting timestamp vector (NullableTimestampVector) is still branch-free at the cell level. cc @jacques-n can you take a look and let me know what you think?

@icexelloss
Copy link
Contributor Author

@wesm @jacques-n @siddharthteotia Timing wise I should be able to make this into 0.8 if we are happy about this approach.

@jacques-n
Copy link
Contributor

On first look this makes sense. However, if I'm reading this right, the reader/writer implementations have lost the ability to set the right unit. How does one do this?

@icexelloss
Copy link
Contributor Author

@jacques-n I am fixing that part.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as there is no performance degredation. My biggest concern is that maybe there are some operations that now need to check for compatible time units.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were previously flipped in ARROW-1091 and then flipped again in ARROW-1092. I thought they were right already, are they not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse ordering the these typeParams are driving me nuts... I changed it such that they follow the same order everywhere. Also see: https://github.com/apache/arrow/pull/1330/files#r152436539

The end results are the same because flipped twice..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the friendlyType param still used any more? If so then will LocalDateTime work with a timezone set?

Copy link
Contributor Author

@icexelloss icexelloss Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used. The way it currently works is if timezone is specified, if will return the LocalDateTime in the specified time zone.

For example:

value = 0, timezone = null

returns LocalDateTime(1970-01-01 00:00:00)

value = 0, timezone = "America/New_York"

returns LocalDateTime(1969-12-31 19:00:00)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this needs to be added now? If it does can it be scoped to compile phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be build dependency. Before it was using 2.3.21.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So without specifying the version it defaults to 2.3.21? Was that version causing an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that's the version comes with the plugin.

I used <#sep> notation in the template to created comma separated arg list. The notation is introduced in 2.3.23

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making param order more sane.. It's driving me nuts we are flipping the ordering twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to remove?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe move this import up with DateTimeZone, and move the TimeUnit down with other arrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need that checkstyle :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these set methods need to check that the type params are compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Although it does add branching on the array. Maybe don't check for performance reason, not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to leave this as it was and just put a switch statement in getTimeUnit to return the corresponding java TimeUnit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why that's better...Why do you think so? Maybe I missed sth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well before the enum was defined by the flatbufId, but here it is the flatbufId and the Java TimeUnit which is redundant information. So a check for equality would have to look at both of these fields. Maybe not a big deal though..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icexelloss
Copy link
Contributor Author

@jacques-n During the refactoring, I discovered that union vector doesn't support types with type params (i.e. Decimal and Timestamp), so I fixed that as well. Now union vectors support decimal as well.

The PR now includes two major changes:

  • Consolidate all timestamp vectors into a single vector class
  • Add support for non-simple minor type (i.e., decimal, timestamp) in Union

Most of the template change is to support types with type params, most of them replaces:

<#if !minor.typeParams?? >
// handle types with no type params
</#if>

to

<#if !minor.typeParams?? >
// handle types with no type params
<#else>
// handle types with type params
</#if>

Please take a look when you have chance. Thanks! Hopefully we can clean this up and merge the week after Thanksgiving.

@icexelloss
Copy link
Contributor Author

@BryanCutler Thanks for the code review. I will clean this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making type params order more sane...

@icexelloss
Copy link
Contributor Author

icexelloss commented Nov 22, 2017

@BryanCutler I am a bit reluctant to check unit and timezone in value holders because of performance reasons. This is currently not checked with other type with type params either, such as decimal. We should maybe visit this problem as a whole as follow up?

@icexelloss icexelloss changed the title wip: ARROW-1816: [Java] Resolve new vector classes structure for timestamp, date and maybe interval ARROW-1816: [Java] Resolve new vector classes structure for timestamp, date and maybe interval Nov 22, 2017
@BryanCutler
Copy link
Member

@BryanCutler I am a bit reluctant to check unit and timezone in value holders because of performance reasons.

Yeah it doesn't make sense to do all these checks each access, so I just wanted to pose the question to make sure it wasn't a blocker for doing this refactor. I don't use the holder APIs so it's fine with me but maybe @siddharthteotia has some thoughts on this?

@icexelloss icexelloss force-pushed the timestamp-vector-refactor branch from dc04f92 to 9732486 Compare November 22, 2017 19:34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more efficient to have a toMillis method on TimeUnit (and encapsulate this detail per law of demeter)? I guess we need to run some microbenchmarks to be able to judge performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I encapsulated java TimeUnit detail.

Also I ran a microbench toMilliis for 1B long values. Performance is the same.

@wesm
Copy link
Member

wesm commented Nov 28, 2017

Needs rebase after ARROW-1710. If we are going to change this class structure, 0.8.0 would be the right time to do it. I'm concerned about delaying the release much further though -- if we can't get the Java work wrapped up this week, we can probably take another week, but we're getting into the red zone as far as the Spark 2.3.0 timeline is concerned

@icexelloss icexelloss force-pushed the timestamp-vector-refactor branch from 9ab65d5 to b20f3d0 Compare November 28, 2017 19:57
@icexelloss
Copy link
Contributor Author

@wesm I rebased after ARROW-1770. This should be ready.

@jacques-n can you take a look? I'd really like this to get in 0.8 if this looks good. Thanks

Copy link
Member

@BryanCutler BryanCutler Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is possible to just have getWriter(ArrowType)? Having 2 methods so similar is a little confusing.. This is because you need a writer for timestamp type params too?

Copy link
Contributor Author

@icexelloss icexelloss Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree ideally we should have just one. However currently, this is needed to support this:

TimestampWriter timestampWriter = rootWriter.timestamp("a", TimeUnit.SECOND, "America/New_York");
timestampWriter.writeTimestamp(1000)

The second line here, the type of "timestampWriter" is a promotable writer and it needs to lookup the writer by MinorType.TIMESTAMP

We need to do more refactoring to support this API without abstract protected FieldWriter getWriter(MinorType type), but since this is going to be internal change and this PR is already quite complicated, I prefer this to be a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good

compiles. Need to fix tests.

Fix tests

Tests passed

Fix error

Remove println

Rebase

Fix union reader/writer. Add more tests

Add testTimestampVector test; Modify unionWriter template

Hide java TimeUnit in arrow TimeUnit
@wesm wesm force-pushed the timestamp-vector-refactor branch from c04f15b to 6c33285 Compare December 1, 2017 16:54
@icexelloss icexelloss closed this Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants