-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12820: [C++] Support zone offset in ISO8601, strptime parser #11358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
For any R experts, I'm not quite sure what to do about the R bindings here: Line 701 in 3cacc85
To me it seems like what R does is convert the timezone after parsing, i.e. we need a timezone conversion kernel, and it's not related to actually parsing the value as implied by the comment. |
Probably we can solve this by casting to the desired timezone after parsing. See strftime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the expected type include timezone="UTC"?
That would preserve the fact that the strings actually were timezone-aware, and were shifted to UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I suppose a complication is that we should only do this if all strings in the array have an offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating this. I suppose if you have %z, then you expect an offset so we should return a timestamp with timezone. Do we also want to do this for the ISO8601 parser? That would have implications in a lot of places, e.g. CSV type inference/parsing, or (I think) pyarrow.array inferring timestamps from strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose ISO8601 parser should be as close to the standard as possible to avoid surprises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? I'm talking about whether the return type of the ISO8601 parser (or really, the places that use the parser) should reflect whether there was a zone offset in the input string(s) or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error by default is fine and correct if there are a mix of naive and aware timestamps.
If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware. The timezone is really more of a consumption-time concern than a production-time concern (i.e. it is probably most likely going to be converted to the local timezone of the consuming user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks all for the comments.
I'm OOO this week but when I get a chance next, I'll update the CSV parser to track the zone offset and return either "UTC", no timezone, or an error. (If a user wants to preserve a consistent non-UTC offset that can be tackled later.) I think casting is another place that needs to be updated, as well as Python pyarrow.array inference (though that may just also use casting? Not sure off the top of my head).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a matter of varying offsets (but all aware timestamps) then it would probably be also correct to just pick a timezone (e.g. UTC) and use that for everything. In fact, it would probably even be valid to just always use UTC if all timestamps are aware.
Yes, I agree we could always use UTC, even if the offsets are all the same. Having varying offsets is quite normal, if you have data across a DST, so I think we should handle that by default (and return in UTC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTC default sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the CSV reader and added a doc blurb.
I ignored casting since the user specifies the timezone (or lack thereof) so presumably it's up to them to do any adjustments they want, and pyarrow.array doesn't infer timestamps from strings (I thought it did, apparently not).
|
There are some errors here for R/MacOS that I'll fix when I get a chance. |
docs/source/cpp/csv.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this now, I am not sure this is a good idea. At least my initial expectation, if parsing a string like "2021-01-01 09:00" and saying the type of that column should be timestamp("us", tz="Europe/Brussels"), would be that the string is interpreted in the timezone I am explicitly passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what currently happens without this PR, but I hear you. I'll give this a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't try it with actual code on current master. I was looking at the last changes giving me the impression you changed this compared to how it was working before ;)
Hmm, so if we change that, that's another change in behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently we have this behaviour:
In [26]: s = """col
...: 2021-01-01 09:00:00
...: """
In [27]: csv.read_csv(io.BytesIO(s.encode()))
Out[27]:
pyarrow.Table
col: timestamp[s]
----
col: [[2021-01-01 09:00:00]]
In [28]: s2 = """col
...: 2021-01-01 09:00:00+01:00
...: """
In [29]: csv.read_csv(io.BytesIO(s2.encode()))
Out[29]:
pyarrow.Table
col: string
----
col: [["2021-01-01 09:00:00+01:00"]]So with a offset the "inference" doesn't actually infer timestamp (does this PR change that?).
And when explicitly mentioning the type for values without a timezone offset:
In [35]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s')}))
Out[35]:
pyarrow.Table
col: timestamp[s]
----
col: [[2021-01-01 09:00:00]]
In [36]: csv.read_csv(io.BytesIO(s.encode()), convert_options=csv.ConvertOptions(column_types={"col": pa.timestamp('s', tz="Europe/Brussels")}))
Out[36]:
pyarrow.Table
col: timestamp[s, tz=Europe/Brussels]
----
col: [[2021-01-01 09:00:00]]So here we indeed kind of "ignore" the timezone of the specified type and kind of assume the naive strings are in UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, that is the same for casting strings to timestamp, though:
In [46]: arr = pa.array(["2021-01-01 09:00:00"])
In [47]: arr.cast(pa.timestamp('s', tz='Europe/Brussels'))
Out[47]:
<pyarrow.lib.TimestampArray object at 0x7f95c9d72fa0>
[
2021-01-01 09:00:00
]CSV parsing and casting strings should probably behave the same in this aspect. But personally I would argue that both are "wrong" (or at least unexpected to me. I would rather prefer it to error than silently interpreting the strings as UTC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means that
read_csv(...).column("start_time").cast(pa.timestamp('s', 'Europe/Brussels'))would give a different answer thanread_csv(..., types={'start_time': pa.timestamp('s', 'Europe/Brussels')}).column("start_time").
I think we should ensure those two are equivalent. If we interpret native strings as local time when specifying a timezone-aware type in the csv parsing, I think casting should have the same behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking to Neal offline it looks like the test isn't really meant to check this case, plus he noted we could always start with an error and make it more implicit later - so I'll roll these changes back (and update the table below as noted by Joris).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed the casting comment - I thought casting to different timezone was always assumed to be a metadata-only operation? i.e. it wouldn't change the values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I assumed that it was about a string -> timestamp[tz] cast (and so this should IMO be consistent for csv parsing vs explicit casting).
But, if the CSV reader infers timestamp, the example of Weston is actually doing a timestamp -> timestamp[tz] cast.
Now, personally, I also think that such a cast is ambiguous (it's only for a timestamp[tz] -> timestamp[tz] cast that I find it clear that this will be a metadata-only operation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, string->timestamp[tz] should be consistent with the CSV reader, I agree.
In this case, the CSV reader would normally infer timestamp, yes. I would argue the conversion should be handled by assume_timezone and that the cast should be metadata-only. (In general, our casts are a mix of conversions and "reinterpretations" of data, which gets a little confusing...)
docs/source/cpp/csv.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the behaviour as is now (related to my comment above), I would certainly add the case of timestamp[s, non-UTC-tz] in this table, to clearly document that behaviour as well.
What do you mean exactly with "ingored casting"? Will this work and take into account the timezone offset? |
Sorry, I meant that I didn't really test it.
I'll test this more thoroughly. |
|
This is what happens on this branch: so as with CSV, this needs to be fixed - I'll take a look. |
|
Ah, we probably then want an option (for both cast/CSV parsing), much like the assume_timezone kernel, that controls what to do with ambiguous or nonexistent local times. I also realize, this needs to account for what to do with custom timezone parsers in CSV… |
|
Casts are fixed, now to go update the CSV parser as well. |
686008a to
1857264
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cur < options.format.size() - 1 perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the templating is useful. Parsing the timestamp should be more costly than a mostly predictable branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the templating.
Filed ARROW-14581 for the Travis test failure.
cpp/src/arrow/csv/converter.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I don't think templating is terribly useful here.
fdf33a8 to
98b1b52
Compare
|
Rebased & fixed conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first condition should be superfluous now :-)
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test strings here for the case of fully unzoned? (although in code that probably triggers the same check as the mixed case?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add the case of parsing a string with trailing Z?
That now seems to work with this PR:
In [26]: pc.strptime(["2012-01-01 09:00:00Z"], format="%Y-%m-%d %H:%M:%S%z", unit="s")
Out[26]:
<pyarrow.lib.TimestampArray object at 0x7fb709a65760>
[
2012-01-01 09:00:00
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this actually already worked before as well, but the resulting type is different (timezone naive vs aware). So might still be worth checking that change in type explicitly, unless that is already covered elsewhere in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things here:
- BSD strptime doesn't support "Z" as noted in the comment. It only supports the syntax here, making it hard to test.
- The result type is based on the presence of a "%z" so that's covered here already.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you :-)
|
Benchmark runs are scheduled for baseline = 2b10648 and contender = a9f2091. a9f2091 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
For ISO8601, this seems to have a small (~5%) impact on benchmarks.
For strptime, this is only supported on platforms exposing
tm_gmtoffinstruct tm.%Zstill is ignored; it seems implementations don't really support it anyways. (For instance GNU libc will skip over the time zone, omitting it from the result.)