Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Oct 7, 2021

For ISO8601, this seems to have a small (~5%) impact on benchmarks.

For strptime, this is only supported on platforms exposing tm_gmtoff in struct tm. %Z still is ignored; it seems implementations don't really support it anyways. (For instance GNU libc will skip over the time zone, omitting it from the result.)

@github-actions
Copy link

github-actions bot commented Oct 7, 2021

@github-actions
Copy link

github-actions bot commented Oct 7, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member Author

lidavidm commented Oct 7, 2021

For any R experts, I'm not quite sure what to do about the R bindings here:

# ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).

To me it seems like what R does is convert the timezone after parsing, i.e. we need a timezone conversion kernel, and it's not related to actually parsing the value as implied by the comment.

@lidavidm lidavidm marked this pull request as ready for review October 7, 2021 21:18
@rok
Copy link
Member

rok commented Oct 8, 2021

To me it seems like what R does is convert the timezone after parsing, i.e. we need a timezone conversion kernel, and it's not related to actually parsing the value as implied by the comment.

Probably we can solve this by casting to the desired timezone after parsing. See strftime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the expected type include timezone="UTC"?
That would preserve the fact that the strings actually were timezone-aware, and were shifted to UTC.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I suppose a complication is that we should only do this if all strings in the array have an offset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating this. I suppose if you have %z, then you expect an offset so we should return a timestamp with timezone. Do we also want to do this for the ISO8601 parser? That would have implications in a lot of places, e.g. CSV type inference/parsing, or (I think) pyarrow.array inferring timestamps from strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose ISO8601 parser should be as close to the standard as possible to avoid surprises?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? I'm talking about whether the return type of the ISO8601 parser (or really, the places that use the parser) should reflect whether there was a zone offset in the input string(s) or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error by default is fine and correct if there are a mix of naive and aware timestamps.

If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware. The timezone is really more of a consumption-time concern than a production-time concern (i.e. it is probably most likely going to be converted to the local timezone of the consuming user).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks all for the comments.

I'm OOO this week but when I get a chance next, I'll update the CSV parser to track the zone offset and return either "UTC", no timezone, or an error. (If a user wants to preserve a consistent non-UTC offset that can be tackled later.) I think casting is another place that needs to be updated, as well as Python pyarrow.array inference (though that may just also use casting? Not sure off the top of my head).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware.

Yes, I agree we could always use UTC, even if the offsets are all the same. Having varying offsets is quite normal, if you have data across a DST, so I think we should handle that by default (and return in UTC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTC default sounds good!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the CSV reader and added a doc blurb.

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

@lidavidm
Copy link
Member Author

There are some errors here for R/MacOS that I'll fix when I get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this now, I am not sure this is a good idea. At least my initial expectation, if parsing a string like "2021-01-01 09:00" and saying the type of that column should be timestamp("us", tz="Europe/Brussels"), would be that the string is interpreted in the timezone I am explicitly passing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what currently happens without this PR, but I hear you. I'll give this a fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't try it with actual code on current master. I was looking at the last changes giving me the impression you changed this compared to how it was working before ;)

Hmm, so if we change that, that's another change in behaviour.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we have this behaviour:

In [26]: s = """col
    ...: 2021-01-01 09:00:00
    ...: """

In [27]: csv.read_csv(io.BytesIO(s.encode()))
Out[27]: 
pyarrow.Table
col: timestamp[s]
----
col: [[2021-01-01 09:00:00]]

In [28]: s2 = """col
    ...: 2021-01-01 09:00:00+01:00
    ...: """

In [29]: csv.read_csv(io.BytesIO(s2.encode()))
Out[29]: 
pyarrow.Table
col: string
----
col: [["2021-01-01 09:00:00+01:00"]]

So with a offset the "inference" doesn't actually infer timestamp (does this PR change that?).

And when explicitly mentioning the type for values without a timezone offset:

In [35]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s')}))
Out[35]: 
pyarrow.Table
col: timestamp[s]
----
col: [[2021-01-01 09:00:00]]

In [36]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s', tz="Europe/Brussels")}))
Out[36]: 
pyarrow.Table
col: timestamp[s, tz=Europe/Brussels]
----
col: [[2021-01-01 09:00:00]]

So here we indeed kind of "ignore" the timezone of the specified type and kind of assume the naive strings are in UTC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, that is the same for casting strings to timestamp, though:

In [46]: arr = pa.array(["2021-01-01 09:00:00"])

In [47]: arr.cast(pa.timestamp('s', tz='Europe/Brussels'))
Out[47]: 
<pyarrow.lib.TimestampArray object at 0x7f95c9d72fa0>
[
  2021-01-01 09:00:00
]

CSV parsing and casting strings should probably behave the same in this aspect. But personally I would argue that both are "wrong" (or at least unexpected to me. I would rather prefer it to error than silently interpreting the strings as UTC)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that read_csv(...).column("start_time").cast(pa.timestamp('s', 'Europe/Brussels')) would give a different answer than read_csv(..., types={'start_time': pa.timestamp('s', 'Europe/Brussels')}).column("start_time").

I think we should ensure those two are equivalent. If we interpret native strings as local time when specifying a timezone-aware type in the csv parsing, I think casting should have the same behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking to Neal offline it looks like the test isn't really meant to check this case, plus he noted we could always start with an error and make it more implicit later - so I'll roll these changes back (and update the table below as noted by Joris).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed the casting comment - I thought casting to different timezone was always assumed to be a metadata-only operation? i.e. it wouldn't change the values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I assumed that it was about a string -> timestamp[tz] cast (and so this should IMO be consistent for csv parsing vs explicit casting).

But, if the CSV reader infers timestamp, the example of Weston is actually doing a timestamp -> timestamp[tz] cast.
Now, personally, I also think that such a cast is ambiguous (it's only for a timestamp[tz] -> timestamp[tz] cast that I find it clear that this will be a metadata-only operation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, string->timestamp[tz] should be consistent with the CSV reader, I agree.

In this case, the CSV reader would normally infer timestamp, yes. I would argue the conversion should be handled by assume_timezone and that the cast should be metadata-only. (In general, our casts are a mix of conversions and "reinterpretations" of data, which gets a little confusing...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the behaviour as is now (related to my comment above), I would certainly add the case of timestamp[s, non-UTC-tz] in this table, to clearly document that behaviour as well.

@jorisvandenbossche
Copy link
Member

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

What do you mean exactly with "ingored casting"?
Didn't checkout this branch yet, but so currently on master this actually fails:

In [17]: arr = pa.array(["2021-01-01 09:00:00+01:00"])

In [18]: arr.cast(pa.timestamp("s", tz="UTC"))
...
ArrowInvalid: Failed to parse string: '2021-01-01 09:00:00+01:00' as a scalar of type timestamp[s, tz=UTC]

Will this work and take into account the timezone offset?

@lidavidm
Copy link
Member Author

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

What do you mean exactly with "ingored casting"? Didn't checkout this branch yet, but so currently on master this actually fails:

Sorry, I meant that I didn't really test it.

In [17]: arr = pa.array(["2021-01-01 09:00:00+01:00"])

In [18]: arr.cast(pa.timestamp("s", tz="UTC"))
...
ArrowInvalid: Failed to parse string: '2021-01-01 09:00:00+01:00' as a scalar of type timestamp[s, tz=UTC]

Will this work and take into account the timezone offset?

I'll test this more thoroughly.

@lidavidm
Copy link
Member Author

This is what happens on this branch:

>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7f312c111760>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s", tz="UTC"))
<pyarrow.lib.TimestampArray object at 0x7f312c111700>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s", tz="America/New_York"))
<pyarrow.lib.TimestampArray object at 0x7f312c111760>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00-0500"]).cast(pa.timestamp("s", tz="America/New_York"))
<pyarrow.lib.TimestampArray object at 0x7f312c111700>
[
  2021-01-01 14:00:00
]
>>> pa.array(["2021-01-01 09:00:00+0500"]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7f312c111520>
[
  2021-01-01 04:00:00
]

so as with CSV, this needs to be fixed - I'll take a look.

@lidavidm
Copy link
Member Author

Ah, we probably then want an option (for both cast/CSV parsing), much like the assume_timezone kernel, that controls what to do with ambiguous or nonexistent local times.

I also realize, this needs to account for what to do with custom timezone parsers in CSV…

@lidavidm
Copy link
Member Author

Casts are fixed, now to go update the CSV parser as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cur < options.format.size() - 1 perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the templating is useful. Parsing the timestamp should be more costly than a mostly predictable branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the templating.

Filed ARROW-14581 for the Travis test failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I don't think templating is terribly useful here.

@lidavidm lidavidm force-pushed the arrow-12820 branch 2 times, most recently from fdf33a8 to 98b1b52 Compare November 5, 2021 13:30
@lidavidm
Copy link
Member Author

lidavidm commented Nov 5, 2021

Rebased & fixed conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first condition should be superfluous now :-)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test strings here for the case of fully unzoned? (although in code that probably triggers the same check as the mixed case?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add the case of parsing a string with trailing Z?

That now seems to work with this PR:

In [26]: pc.strptime(["2012-01-01 09:00:00Z"], format="%Y-%m-%d %H:%M:%S%z", unit="s")
Out[26]: 
<pyarrow.lib.TimestampArray object at 0x7fb709a65760>
[
  2012-01-01 09:00:00
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this actually already worked before as well, but the resulting type is different (timezone naive vs aware). So might still be worth checking that change in type explicitly, unless that is already covered elsewhere in the tests.

Copy link
Member Author

@lidavidm lidavidm Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things here:

  • BSD strptime doesn't support "Z" as noted in the comment. It only supports the syntax here, making it hard to test.
  • The result type is based on the presence of a "%z" so that's covered here already.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you :-)

@pitrou pitrou closed this in a9f2091 Nov 10, 2021
@ursabot
Copy link

ursabot commented Nov 10, 2021

Benchmark runs are scheduled for baseline = 2b10648 and contender = a9f2091. a9f2091 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.66% ⬆️0.13%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants