ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser #11358

lidavidm · 2021-10-07T19:54:32Z

For ISO8601, this seems to have a small (~5%) impact on benchmarks.

For strptime, this is only supported on platforms exposing tm_gmtoff in struct tm. %Z still is ignored; it seems implementations don't really support it anyways. (For instance GNU libc will skip over the time zone, omitting it from the result.)

github-actions · 2021-10-07T19:54:51Z

https://issues.apache.org/jira/browse/ARROW-12820

github-actions · 2021-10-07T19:54:52Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm · 2021-10-07T19:55:30Z

For any R experts, I'm not quite sure what to do about the R bindings here:

arrow/r/R/dplyr-functions.R

Line 701 in 3cacc85

    
           # ParseTimestampStrptime currently ignores the timezone information (ARROW-12820).

To me it seems like what R does is convert the timezone after parsing, i.e. we need a timezone conversion kernel, and it's not related to actually parsing the value as implied by the comment.

rok · 2021-10-08T14:58:37Z

To me it seems like what R does is convert the timezone after parsing, i.e. we need a timezone conversion kernel, and it's not related to actually parsing the value as implied by the comment.

Probably we can solve this by casting to the desired timezone after parsing. See strftime.

jorisvandenbossche · 2021-10-08T15:10:10Z

cpp/src/arrow/compute/kernels/scalar_string_test.cc

Should the expected type include timezone="UTC"?
That would preserve the fact that the strings actually were timezone-aware, and were shifted to UTC.

Although I suppose a complication is that we should only do this if all strings in the array have an offset

I was debating this. I suppose if you have %z, then you expect an offset so we should return a timestamp with timezone. Do we also want to do this for the ISO8601 parser? That would have implications in a lot of places, e.g. CSV type inference/parsing, or (I think) pyarrow.array inferring timestamps from strings.

I suppose ISO8601 parser should be as close to the standard as possible to avoid surprises?

What do you mean? I'm talking about whether the return type of the ISO8601 parser (or really, the places that use the parser) should reflect whether there was a zone offset in the input string(s) or not.

Error by default is fine and correct if there are a mix of naive and aware timestamps.

If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware. The timezone is really more of a consumption-time concern than a production-time concern (i.e. it is probably most likely going to be converted to the local timezone of the consuming user).

Makes sense, thanks all for the comments.

I'm OOO this week but when I get a chance next, I'll update the CSV parser to track the zone offset and return either "UTC", no timezone, or an error. (If a user wants to preserve a consistent non-UTC offset that can be tackled later.) I think casting is another place that needs to be updated, as well as Python pyarrow.array inference (though that may just also use casting? Not sure off the top of my head).

If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware.

Yes, I agree we could always use UTC, even if the offsets are all the same. Having varying offsets is quite normal, if you have data across a DST, so I think we should handle that by default (and return in UTC).

UTC default sounds good!

Updated the CSV reader and added a doc blurb.

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

lidavidm · 2021-10-18T21:14:57Z

There are some errors here for R/MacOS that I'll fix when I get a chance.

jorisvandenbossche · 2021-10-20T13:46:13Z

docs/source/cpp/csv.rst

Reading this now, I am not sure this is a good idea. At least my initial expectation, if parsing a string like "2021-01-01 09:00" and saying the type of that column should be timestamp("us", tz="Europe/Brussels"), would be that the string is interpreted in the timezone I am explicitly passing.

This is what currently happens without this PR, but I hear you. I'll give this a fix

Ah, I didn't try it with actual code on current master. I was looking at the last changes giving me the impression you changed this compared to how it was working before ;)

Hmm, so if we change that, that's another change in behaviour.

So currently we have this behaviour:

In [26]: s = """col ...: 2021-01-01 09:00:00 ...: """ In [27]: csv.read_csv(io.BytesIO(s.encode())) Out[27]: pyarrow.Table col: timestamp[s] ---- col: [[2021-01-01 09:00:00]] In [28]: s2 = """col ...: 2021-01-01 09:00:00+01:00 ...: """ In [29]: csv.read_csv(io.BytesIO(s2.encode())) Out[29]: pyarrow.Table col: string ---- col: [["2021-01-01 09:00:00+01:00"]]

So with a offset the "inference" doesn't actually infer timestamp (does this PR change that?).

And when explicitly mentioning the type for values without a timezone offset:

In [35]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s')})) Out[35]: pyarrow.Table col: timestamp[s] ---- col: [[2021-01-01 09:00:00]] In [36]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s', tz="Europe/Brussels")})) Out[36]: pyarrow.Table col: timestamp[s, tz=Europe/Brussels] ---- col: [[2021-01-01 09:00:00]]

So here we indeed kind of "ignore" the timezone of the specified type and kind of assume the naive strings are in UTC.

Now, that is the same for casting strings to timestamp, though:

In [46]: arr = pa.array(["2021-01-01 09:00:00"]) In [47]: arr.cast(pa.timestamp('s', tz='Europe/Brussels')) Out[47]: <pyarrow.lib.TimestampArray object at 0x7f95c9d72fa0> [ 2021-01-01 09:00:00 ]

CSV parsing and casting strings should probably behave the same in this aspect. But personally I would argue that both are "wrong" (or at least unexpected to me. I would rather prefer it to error than silently interpreting the strings as UTC)

So this means that read_csv(...).column("start_time").cast(pa.timestamp('s', 'Europe/Brussels')) would give a different answer than read_csv(..., types={'start_time': pa.timestamp('s', 'Europe/Brussels')}).column("start_time").

I think we should ensure those two are equivalent. If we interpret native strings as local time when specifying a timezone-aware type in the csv parsing, I think casting should have the same behaviour.

Talking to Neal offline it looks like the test isn't really meant to check this case, plus he noted we could always start with an error and make it more implicit later - so I'll roll these changes back (and update the table below as noted by Joris).

Oh, I missed the casting comment - I thought casting to different timezone was always assumed to be a metadata-only operation? i.e. it wouldn't change the values

Yes, but I assumed that it was about a string -> timestamp[tz] cast (and so this should IMO be consistent for csv parsing vs explicit casting).

But, if the CSV reader infers timestamp, the example of Weston is actually doing a timestamp -> timestamp[tz] cast.
Now, personally, I also think that such a cast is ambiguous (it's only for a timestamp[tz] -> timestamp[tz] cast that I find it clear that this will be a metadata-only operation)

Ah, yes, string->timestamp[tz] should be consistent with the CSV reader, I agree.

In this case, the CSV reader would normally infer timestamp, yes. I would argue the conversion should be handled by assume_timezone and that the cast should be metadata-only. (In general, our casts are a mix of conversions and "reinterpretations" of data, which gets a little confusing...)

jorisvandenbossche · 2021-10-20T13:47:41Z

docs/source/cpp/csv.rst

If we keep the behaviour as is now (related to my comment above), I would certainly add the case of timestamp[s, non-UTC-tz] in this table, to clearly document that behaviour as well.

jorisvandenbossche · 2021-10-20T13:52:02Z

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

What do you mean exactly with "ingored casting"?
Didn't checkout this branch yet, but so currently on master this actually fails:

In [17]: arr = pa.array(["2021-01-01 09:00:00+01:00"])

In [18]: arr.cast(pa.timestamp("s", tz="UTC"))
...
ArrowInvalid: Failed to parse string: '2021-01-01 09:00:00+01:00' as a scalar of type timestamp[s, tz=UTC]

Will this work and take into account the timezone offset?

lidavidm · 2021-10-20T14:19:36Z

I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).

What do you mean exactly with "ingored casting"? Didn't checkout this branch yet, but so currently on master this actually fails:

Sorry, I meant that I didn't really test it.

In [17]: arr = pa.array(["2021-01-01 09:00:00+01:00"])

In [18]: arr.cast(pa.timestamp("s", tz="UTC"))
...
ArrowInvalid: Failed to parse string: '2021-01-01 09:00:00+01:00' as a scalar of type timestamp[s, tz=UTC]

Will this work and take into account the timezone offset?

I'll test this more thoroughly.

lidavidm · 2021-10-20T15:11:33Z

This is what happens on this branch:

>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7f312c111760>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s", tz="UTC"))
<pyarrow.lib.TimestampArray object at 0x7f312c111700>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00"]).cast(pa.timestamp("s", tz="America/New_York"))
<pyarrow.lib.TimestampArray object at 0x7f312c111760>
[
  2021-01-01 09:00:00
]
>>> pa.array(["2021-01-01 09:00:00-0500"]).cast(pa.timestamp("s", tz="America/New_York"))
<pyarrow.lib.TimestampArray object at 0x7f312c111700>
[
  2021-01-01 14:00:00
]
>>> pa.array(["2021-01-01 09:00:00+0500"]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7f312c111520>
[
  2021-01-01 04:00:00
]

so as with CSV, this needs to be fixed - I'll take a look.

lidavidm · 2021-10-20T15:22:45Z

Ah, we probably then want an option (for both cast/CSV parsing), much like the assume_timezone kernel, that controls what to do with ambiguous or nonexistent local times.

I also realize, this needs to account for what to do with custom timezone parsers in CSV…

lidavidm · 2021-10-20T16:54:55Z

Casts are fixed, now to go update the CSV parser as well.

pitrou · 2021-11-03T13:00:48Z

cpp/src/arrow/compute/kernels/scalar_string.cc

cur < options.format.size() - 1 perhaps?

pitrou · 2021-11-03T13:02:29Z

cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc

I don't think the templating is useful. Parsing the timestamp should be more costly than a mostly predictable branch.

Removed the templating.

Filed ARROW-14581 for the Travis test failure.

pitrou · 2021-11-03T13:07:06Z

cpp/src/arrow/csv/converter.cc

Similarly, I don't think templating is terribly useful here.

lidavidm · 2021-11-05T15:12:01Z

Rebased & fixed conflicts.

pitrou · 2021-11-08T16:00:55Z

cpp/src/arrow/compute/kernels/scalar_string.cc

The first condition should be superfluous now :-)

jorisvandenbossche

Looks good!

jorisvandenbossche · 2021-11-08T15:41:25Z

cpp/src/arrow/compute/kernels/scalar_cast_test.cc

Also test strings here for the case of fully unzoned? (although in code that probably triggers the same check as the mixed case?)

jorisvandenbossche · 2021-11-08T15:50:27Z

cpp/src/arrow/compute/kernels/scalar_string_test.cc

Also add the case of parsing a string with trailing Z?

That now seems to work with this PR:

In [26]: pc.strptime(["2012-01-01 09:00:00Z"], format="%Y-%m-%d %H:%M:%S%z", unit="s") Out[26]: <pyarrow.lib.TimestampArray object at 0x7fb709a65760> [ 2012-01-01 09:00:00 ]

Ah, so this actually already worked before as well, but the resulting type is different (timezone naive vs aware). So might still be worth checking that change in type explicitly, unless that is already covered elsewhere in the tests.

A couple things here:

BSD strptime doesn't support "Z" as noted in the comment. It only supports the syntax here, making it hard to test.

The result type is based on the presence of a "%z" so that's covered here already.

pitrou

+1, thank you :-)

ursabot · 2021-11-10T13:22:07Z

Benchmark runs are scheduled for baseline = 2b10648 and contender = a9f2091. a9f2091 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.66% ⬆️0.13%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

github-actions bot added the Component: C++ label Oct 7, 2021

lidavidm marked this pull request as ready for review October 7, 2021 21:18

jorisvandenbossche reviewed Oct 8, 2021

View reviewed changes

lidavidm force-pushed the arrow-12820 branch from 057dfe8 to 5f4eb86 Compare October 18, 2021 16:46

jorisvandenbossche reviewed Oct 20, 2021

View reviewed changes

github-actions bot added Component: Python Component: R labels Oct 20, 2021

lidavidm force-pushed the arrow-12820 branch 2 times, most recently from 686008a to 1857264 Compare October 25, 2021 13:19

pitrou reviewed Nov 3, 2021

View reviewed changes

lidavidm force-pushed the arrow-12820 branch 2 times, most recently from fdf33a8 to 98b1b52 Compare November 5, 2021 13:30

lidavidm mentioned this pull request Nov 5, 2021

ARROW-14231: [C++] Support casting timestamp with timezone to string #11328

Closed

pitrou reviewed Nov 8, 2021

View reviewed changes

jorisvandenbossche reviewed Nov 8, 2021

View reviewed changes

lidavidm force-pushed the arrow-12820 branch from 98b1b52 to 2ff9d4e Compare November 8, 2021 16:28

lidavidm added 2 commits November 9, 2021 14:14

ARROW-12820: [C++] Support zone offset in ISO8601 parser

9e5b019

ARROW-12820: [C++] Fix some oversights

e0ec8b0

lidavidm force-pushed the arrow-12820 branch from 2ff9d4e to e0ec8b0 Compare November 9, 2021 19:16

pitrou approved these changes Nov 10, 2021

View reviewed changes

pitrou closed this in a9f2091 Nov 10, 2021

lidavidm deleted the arrow-12820 branch November 10, 2021 13:50

This was referenced Apr 29, 2022

[C++] Strptime ignores timezone information #28555

Closed

[C++][Docs] Document that the strptime kernel ignores %Z #31315

Open

jorisvandenbossche mentioned this pull request May 5, 2023

[C++] Detection of %z in strptime format fails in certain cases, not returning timestamp with tz=UTC #35448

Closed

ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser #11358

ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser #11358

Uh oh!

Conversation

lidavidm commented Oct 7, 2021

Uh oh!

github-actions bot commented Oct 7, 2021

Uh oh!

github-actions bot commented Oct 7, 2021

Uh oh!

lidavidm commented Oct 7, 2021

Uh oh!

rok commented Oct 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Oct 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Oct 20, 2021

Uh oh!

lidavidm commented Oct 20, 2021

Uh oh!

lidavidm commented Oct 20, 2021

Uh oh!

lidavidm commented Oct 20, 2021

Uh oh!

lidavidm commented Oct 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 8, 2021 •

edited

Loading

jorisvandenbossche Oct 20, 2021 •

edited

Loading

lidavidm Nov 8, 2021 •

edited

Loading

ursabot commented Nov 10, 2021 •

edited

Loading