fix date and datetime parsing#448
fix date and datetime parsing#448srowen merged 5 commits intodatabricks:masterfrom seddonm1:date-datetime-cast
Conversation
srowen
left a comment
There was a problem hiding this comment.
Yeah it makes some sense. The semantics here aren't necessarily that of an XSD, but that's a reasonable standard to follow. Right now I guess the problem is mostly that it's timezone-dependent when no timezone is specified?
Really it would be better to use Spark's time/date handling code here, like in DateTimeUtils, but it might not be fully exposed.
It's a behavior change but probably worth making to make this a little more robust.
This is actually the first PR for better handling of XML data with XSD provided. I have code that needs some cleanup that will generate the correct Spark StructType from the XSD.
Yes, this seems to be an oversight in the standard as you can't represent an Instant without a timezone. My view it is better to have a deterministic outcome (UTC) when the user has not provided the timezone information then at least date math can be used to override if needed.
Maybe but for the complexity of the code involved I am not sure it is beneficial.
To minimise impact the previous |
srowen
left a comment
There was a problem hiding this comment.
To confirm this should continue to work as before when a timezone is supplied right? will just default to UTC when not specified? that's good.
Are there any inputs that fail to parse now that would parse before, that you know of? I'd favor keeping as much behavior as possible without supporting the ambiguous local timezone-dependent parsing of course.
| } | ||
|
|
||
| @scala.annotation.tailrec | ||
| def parseXmlDate(formatters: List[DateTimeFormatter], value: String): Date = { |
There was a problem hiding this comment.
I don't mind this construction with tailrec; I usually just write
for (formatter <- formatters) {
try {
return ...valueOf...
} catch {
_: Exception => // continue
}
}
throw new IllegalArgumentException(...)
There was a problem hiding this comment.
This is a personal choice and as you are the maintainer I am happy to change to your style if you would like?
There was a problem hiding this comment.
Eh, I don't feel strongly about it. Won't matter. I find the for loop simpler but I get the Scala construct here
- whilst not technically XSD valid it was the previous behaviour - also moved out the list of valid formatters as creating them can be quite expensive and doing per value did not make sense
Correct. You can see examples in the tests.
I have re-added support for |
Codecov Report
@@ Coverage Diff @@
## master #448 +/- ##
==========================================
- Coverage 88.76% 88.61% -0.15%
==========================================
Files 17 17
Lines 810 826 +16
Branches 79 77 -2
==========================================
+ Hits 719 732 +13
- Misses 91 94 +3
Continue to review full report at Codecov.
|
|
Again just FYI @HyukjinKwon , no action needed unless you spot anything that jumps out as problematic here. |
Hi,
Currently the parsing of Date and Timestamp types does not comply with the the XSD schema definition: https://www.w3.org/TR/xmlschema-2/#dateTime
This PR adds the official date and datetime parsing when using a user-defined schema.