Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jul 2, 2021

This is to resolve ARROW-13174.

@github-actions
Copy link

github-actions bot commented Jul 2, 2021

@rok rok force-pushed the ARROW-13174 branch 3 times, most recently from 2436df1 to 45acfcd Compare July 9, 2021 00:06
@rok rok force-pushed the ARROW-13174 branch 7 times, most recently from 652d46d to 78d3ffc Compare July 9, 2021 22:31
@rok
Copy link
Member Author

rok commented Jul 9, 2021

@westonpace strftime kernel is almost ready for review.
I do need to take care of the string encoding.

@rok rok force-pushed the ARROW-13174 branch 6 times, most recently from cfa23ee to 9e8b775 Compare July 12, 2021 23:44
@rok rok marked this pull request as ready for review July 13, 2021 00:21
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Looks great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default format string include the trailing %z so that it is ISO-8601 compliant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Will add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that %z adds a trailing z as in the ISO format. It adds a numeric offset (eg https://www.cplusplus.com/reference/ctime/strftime/):

In [44]: datetime.datetime(2012, 1, 1, tzinfo=datetime.timezone.utc).strftime("%Y-%m-%dT%H:%M:%S%z")
Out[44]: '2012-01-01T00:00:00+0000'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to wikipedia these are valid formats:

  1. 2021-07-15T11:24:54+00:00
  2. 2021-07-15T11:24:54Z
  3. 20210715T112454Z

%z will give us option 1. I don't have a strong preference and pandas doesn't seem to have a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is significant effort I agree that +00:00 is good enough but I personally prefer Z if there is a quick option to enable it when the time zone is UTC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to %Y-%m-%dT%H:%M:%SZ (2.) as default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: How precise is this 30? Is it based on the default format string? If I look at 2021-07-12T23:55:41+00:00 I would think you only need ~25, maybe more if subsecond resolution. Could we base this on the resolution? It's not a big deal so if it seems like too much don't worry.

Copy link
Member Author

@rok rok Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a random date's string size times a fudge factor? :)

int expected_string_size = round(get_timestamp<Duration>(0, &options).size() * 1.1);
RETURN_NOT_OK(string_builder->Reserve(in.length * expected_string_size));

@rok rok force-pushed the ARROW-13174 branch 5 times, most recently from 2033695 to 8ce4ab4 Compare July 13, 2021 14:22
@rok
Copy link
Member Author

rok commented Jul 13, 2021

Thanks for the review @westonpace!
I've gone through it and pushed appropriate changes.

One thing I'm not sure about is if we should be using date.h for formatting here or would it be better to use system / c++ equivalents.

@pitrou
Copy link
Member

pitrou commented Aug 17, 2021

Yes, you can. Make it return a Result<const time_zone*>

@rok
Copy link
Member Author

rok commented Aug 17, 2021

Yes, you can. Make it return a Result<const time_zone*>

@pitrou Done. Please review.

Comment on lines +420 to +436
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note this is only testing for C locale because that is all that is available in CI at the moment. Ideally we would cover several.

@pitrou pitrou force-pushed the ARROW-13174 branch 2 times, most recently from ff49e17 to 4b1d68a Compare August 18, 2021 14:00
@pitrou
Copy link
Member

pitrou commented Aug 18, 2021

Will merge if green.

@pitrou pitrou closed this in 2ef0edc Aug 18, 2021
@rok
Copy link
Member Author

rok commented Aug 18, 2021

Nice! Thanks @westonpace @jorisvandenbossche @pitrou

@jorisvandenbossche
Copy link
Member

Sorry for the slow response here, but I think there are still a few behavioural aspects to fix/clarify:

  • Related to @westonpace's comment above (ARROW-13174: [C++][Compute] Add strftime kernel #10647 (comment)), you added a "Z" to the default format. However, this is only correct if you have a UTC timezone, and not for any other timezone. For example:
    >>> ts = pd.to_datetime(["2018-03-10 09:00"]).tz_localize("US/Eastern")
    >>> ts
    DatetimeIndex(['2018-03-10 09:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
    >>> tsa = pa.array(ts)
    >>> tsa
    <pyarrow.lib.TimestampArray object at 0x7f7350b087c0>
    [
      2018-03-10 14:00:00.000000000
    ]
    
    >>> pc.strftime(tsa)
    <pyarrow.lib.StringArray object at 0x7f7350b74a60>
    [
      "2018-03-10T09:00:00.000000000Z"
    ]
    So it's correctly showing the timestamp in the timezone's local time, but thus the "Z" indicator for UTC is wrong (the correct UTC time is 14:00, not 09:00).
    I think we should only add the "Z" indicator if the timezone is UTC. I am not fully sure what we should then use as default format for non-UTC timezones though: don't show any timezone information, include a numeric offset, or error.
    That would also mean that the "default" format string would depend on the input type of the data, which might not be easy / desirable.
  • I commented about the timezone handling when the initial PR had a keyword for this, but I forgot to reply after you removed that keyword (and support for local timestamps) altogether. But, what's the reasoning for disallowing local timestamps without timezone? I don't think there is any ambiguity in how they would be formatted? (after all, they represent "clock" time, which in the end is kind of a formatted string)

  • There was some discussion above about the behaviour of %S (ARROW-13174: [C++][Compute] Add strftime kernel #10647 (comment)), where date.h / C++ handles it differently as Python or R (i.e. we are including the fractional sub-second decimals, and there is no easy way to only show integer seconds apart from casting to timestamp("s") first AFAIK).
    Since there are conflicting standards vs language implementations, there is no easy way to solve this. But I think it would be good to at least document this difference (it will be surprising for Python/R users) and how to work-around it.

@rok
Copy link
Member Author

rok commented Aug 20, 2021

Thanks for the review @jorisvandenbossche !
I think we need to address these points - I'll open a follow-up JIRA and do some work in a week or so.

  1. Indeed, current situation (Z after non UTC string) is wrong. I think your suggestion (check input timezone and append Z to default format if UTC) would be ok.
  2. I forget what the thought was at the time. But looking at our discussion it might have been a date.h limitation. I'll look into it.
  3. Agreed! We could also look into upstreaming a behavior flag to date.h but I think we'd need stronger motivation.

@rok
Copy link
Member Author

rok commented Aug 20, 2021

@jorisvandenbossche
Copy link
Member

Thanks!

@rok
Copy link
Member Author

rok commented Aug 25, 2021

Follow up PR #10998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants