Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jun 28, 2021

This is to resolve ARROW-13033.

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the discussion on the JIRA issue, can you a bit more clearly state what exactly you implemented and what part it covers?
For example, the current kernel returns int64 and not timestamp with timezone?

It also seems that the implementation is converting system time (UTC epoch) to local (naive) time, and not the other way around as the docstring says.

(also, given all the confusion also on the JIRA, we should probably try to come up with a more explicit/descriptive name ...)

@rok rok force-pushed the ARROW-13033 branch 2 times, most recently from 3b14063 to 7fa50a6 Compare July 2, 2021 00:37
@rok
Copy link
Member Author

rok commented Jul 2, 2021

Thanks for the review @jorisvandenbossche :).
I've gone through your feedback and done some of the boilerplate except for the correct output timestamp type. I'll try to do it over the weekend.

(also, given all the confusion also on the JIRA, we should probably try to come up with a more explicit/descriptive name ...)

I think there is no consensus on this yet. tz_convert, tz_localize were mentioned. I'd add tz_normalize. Any other?

@jorisvandenbossche
Copy link
Member

@rok can you explain the rationale of adding a source_timezone and destination_timezone to the API? What's the reason for doing it like this, instead of having a single target timezone as argument?

@rok
Copy link
Member Author

rok commented Jul 2, 2021

@rok can you explain the rationale of adding a source_timezone and destination_timezone to the API? What's the reason for doing it like this, instead of having a single target timezone as argument?

In case we start with "naive" a timestamp we don't know the source and target timezone so both need to be specified. Or am I missing something here?

@jorisvandenbossche
Copy link
Member

What's the difference between both timezones? In my mind this is the same time zone? If you want to convert your localized timezone-aware timestamp to another timezone after localizing, you can use another kernel to do that? (a kernel we don't have yet, but one we should also add, although this is a trivial, metadata-only change)

@rok
Copy link
Member Author

rok commented Jul 2, 2021

If you want to convert your localized timezone-aware timestamp to another timezone after localizing, you can use another kernel to do that? (a kernel we don't have yet, but one we should also add, although this is a trivial, metadata-only change)

I was thinking only of the metadata change not "materialization" of a target timezone. Having another kernel to only change metadata seems redundant.

@jorisvandenbossche
Copy link
Member

Having another kernel to only change metadata seems redundant.

I don't think that's redundant, as a user will regularly want to change the timezone of a timestamp column that is already localized (already has a timezone).

I am of course coming from the point of view of pandas, where those two operations are defined as two distinct methods: tz_localize to convert from "timestamp without timezone" to "timestamp with timezone" (i.e. from naive clock time to tz-aware time) and tz_convert to convert between "timestamp with timezone" with different timezones (a metadata-only change).
From a user point of view (and also implementation-wise), those two are distinct concepts, IMO, and so I would keep them as separate kernels. (also eg R's lubridate has distinct force_tz and with_tz, respectively)

@jorisvandenbossche
Copy link
Member

(if we keep both source and destination timezone in the "localize" kernel, I think at least the destination timezone, if not specified, should default to be the same as the source timezone, and not "UTC", so that a user doesn't have to specify the timezone they want twice, like localize(timestamp, source_timezone="Europe/Brussels", destination_timezone="Europe/Brussels") to end up with a timestamp(unit, tz="Europe/Brussels) type column)

@jorisvandenbossche
Copy link
Member

I opened https://issues.apache.org/jira/browse/ARROW-13247 for the discussion whether we want a separate "change timezone" kernel

@rok
Copy link
Member Author

rok commented Jul 2, 2021

Having another kernel to only change metadata seems redundant.

I don't think that's redundant, as a user will regularly want to change the timezone of a timestamp column that is already localized (already has a timezone).

I am of course coming from the point of view of pandas, where those two operations are defined as two distinct methods: tz_localize to convert from "timestamp without timezone" to "timestamp with timezone" (i.e. from naive clock time to tz-aware time) and tz_convert to convert between "timestamp with timezone" with different timezones (a metadata-only change).
From a user point of view (and also implementation-wise), those two are distinct concepts, IMO, and so I would keep them as separate kernels. (also eg R's lubridate has distinct force_tz and with_tz, respectively)

I suppose these are two different operations indeed. I've changed this to only do "localization" now.

I like force_tz as it indicates the operation is a bit "dangerous". Candidates are then:

  • force_tz
  • tz_localize
  • tz_normalize

@adamhooper
Copy link
Contributor

Since the Arrow spec is going to refer to "LocalDateTime", "Instant" and "ZonedDateTime", how about reusing terms from the spec? Something along the lines of local_datetime_to_zoned_datetime()?

@rok
Copy link
Member Author

rok commented Jul 2, 2021

Since the Arrow spec is going to refer to "LocalDateTime", "Instant" and "ZonedDateTime", how about reusing terms from the spec? Something along the lines of local_datetime_to_zoned_datetime()?

It's nice that it's idiomatic if a bit verbose :)

Let's add it to the list. If we don't get to consensus here we can always call a vote on the ML.

  • force_tz
  • tz_localize
  • tz_normalize
  • local_datetime_to_zoned_datetime

@westonpace
Copy link
Member

I like all except tz_normalize. "normalize" just has too many meanings for me already.

@jorisvandenbossche
Copy link
Member

@rok I was checking out this PR, and in the process wrote python bindings for the option class to be able to test it in Python, so thought to directly push that here as well. (and at the same time also fixed (I think) the initialization of the timezone, which was causing the C++ tests to fail here as well)

And from testing, you will still need to handle the errors raised by date.h:

In [21]: arr = pa.array([pd.Timestamp("2021-03-28 02:30:00")], type=pa.timestamp("ns"))

In [22]: pc.localize(arr, options=pc.TemporalLocalizationOptions("Europe/Brussels"))
terminate called after throwing an instance of 'arrow_vendored::date::nonexistent_local_time'
  what():  2021-03-28 02:30:00.000000000 is in a gap between
2021-03-28 02:00:00 CET and
2021-03-28 03:00:00 CEST which are both equivalent to
2021-03-28 01:00:00 UTC
Aborted (core dumped)

@rok
Copy link
Member Author

rok commented Jul 6, 2021

@rok I was checking out this PR, and in the process wrote python bindings for the option class to be able to test it in Python, so thought to directly push that here as well. (and at the same time also fixed (I think) the initialization of the timezone, which was causing the C++ tests to fail here as well)

Thanks @jorisvandenbossche! The new option initialization was indeed problematic.

@rok rok force-pushed the ARROW-13033 branch 8 times, most recently from 7c1ebb5 to c5a62a9 Compare July 7, 2021 12:16
@pitrou
Copy link
Member

pitrou commented Aug 25, 2021

I'm going to suggest assume_timezone.

@rok
Copy link
Member Author

rok commented Aug 25, 2021

What would be a good way to reach consensus here? A doodle poll to ML to see what people find most intuitive?

@pitrou
Copy link
Member

pitrou commented Aug 26, 2021

Well, I don't think anything is intuitive here, because the semantics of timestamps in Arrow are slightly weird and surprising :-)

You may still ask for opinions on the ML, but I would personally go with assume_timezone. Users will have to read the documentation in any case.

@rok
Copy link
Member Author

rok commented Aug 26, 2021

Well, I don't think anything is intuitive here, because the semantics of timestamps in Arrow are slightly weird and surprising :-)

You mean timestamps and timezones in general are weird (because the underlying problem is counter intuitive)? If not we should really make this as simple as the underlying complexity is.

@rok rok force-pushed the ARROW-13033 branch 3 times, most recently from 551cb2c to 76ad88f Compare September 2, 2021 11:03
@rok rok requested a review from pitrou September 2, 2021 14:56
@rok
Copy link
Member Author

rok commented Sep 2, 2021

@pitrou I've changed this to assume_timezone (AssumeTimezone, AssumeTimezoneOptions).
Discussion on Zulip seems to be in favor of assume_timestamp.

Would still be good to hear @jorisvandenbossche @adamhooper @westonpace

@adamhooper
Copy link
Contributor

@rok Thanks for asking! I prefer assume_timezone, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be here. Is it deliberate or just a leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Probably a bad rebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this is for. Presumably this should be unreachable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was just to satisfy the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the status but kept the return 0 to keep he compiler happy. Please check.

@rok rok force-pushed the ARROW-13033 branch 3 times, most recently from 008bf23 to 87ddd31 Compare September 6, 2021 17:12
@rok rok requested a review from pitrou September 6, 2021 19:04
@rok
Copy link
Member Author

rok commented Sep 6, 2021

CI fail seems unrelated.

@pitrou
Copy link
Member

pitrou commented Sep 9, 2021

Ok, I've pushed some minor changes and will merge once CI is green. Thanks a lot for this @rok !

@rok
Copy link
Member Author

rok commented Sep 9, 2021

Let me rebase that.

@pitrou
Copy link
Member

pitrou commented Sep 9, 2021

@rok I'm on it already.

Work.

Removing destination_timezone.

fix initialization of timezone + add python bindings

Renaming localize to tz_localize.

Adding ambiguous and nonexistent handling.

Removing R wrapper.

Reintroducing R wrapper.

Issue with date.h building on rtools4.0.

Review feedback.

tz_localize -> assume_timezone

Review feedback.

Changes to nonexistent handling.

More changes to nonexistent handling.

* Documentation changes
* Simplify kernel generation
@rok
Copy link
Member Author

rok commented Sep 9, 2021

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
…(preserving clock-time)

This is to resolve [ARROW-13033](https://issues.apache.org/jira/browse/ARROW-13033).

Closes apache#10610 from rok/ARROW-13033

Authored-by: Rok <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants