Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

fix date and datetime parsing#448

Merged
srowen merged 5 commits intodatabricks:masterfrom
seddonm1:date-datetime-cast
May 27, 2020
Merged

fix date and datetime parsing#448
srowen merged 5 commits intodatabricks:masterfrom
seddonm1:date-datetime-cast

Conversation

@seddonm1
Copy link
Copy Markdown
Contributor

Hi,
Currently the parsing of Date and Timestamp types does not comply with the the XSD schema definition: https://www.w3.org/TR/xmlschema-2/#dateTime

This PR adds the official date and datetime parsing when using a user-defined schema.

Copy link
Copy Markdown
Collaborator

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it makes some sense. The semantics here aren't necessarily that of an XSD, but that's a reasonable standard to follow. Right now I guess the problem is mostly that it's timezone-dependent when no timezone is specified?

Really it would be better to use Spark's time/date handling code here, like in DateTimeUtils, but it might not be fully exposed.

It's a behavior change but probably worth making to make this a little more robust.

@seddonm1
Copy link
Copy Markdown
Contributor Author

Yeah it makes some sense. The semantics here aren't necessarily that of an XSD, but that's a reasonable standard to follow.

This is actually the first PR for better handling of XML data with XSD provided. I have code that needs some cleanup that will generate the correct Spark StructType from the XSD.

Right now I guess the problem is mostly that it's timezone-dependent when no timezone is specified?

Yes, this seems to be an oversight in the standard as you can't represent an Instant without a timezone. My view it is better to have a deterministic outcome (UTC) when the user has not provided the timezone information then at least date math can be used to override if needed.

Really it would be better to use Spark's time/date handling code here, like in DateTimeUtils, but it might not be fully exposed.

Maybe but for the complexity of the code involved I am not sure it is beneficial.

It's a behavior change but probably worth making to make this a little more robust.

To minimise impact the previous Timestamp.valueOf and Date.valueOf behaviors could be appended to the valid formatter list?

Copy link
Copy Markdown
Collaborator

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm this should continue to work as before when a timezone is supplied right? will just default to UTC when not specified? that's good.

Are there any inputs that fail to parse now that would parse before, that you know of? I'd favor keeping as much behavior as possible without supporting the ambiguous local timezone-dependent parsing of course.

}

@scala.annotation.tailrec
def parseXmlDate(formatters: List[DateTimeFormatter], value: String): Date = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind this construction with tailrec; I usually just write

for (formatter <- formatters) {
  try {
    return ...valueOf...
  } catch {
    _: Exception => // continue
  }
}
throw new IllegalArgumentException(...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a personal choice and as you are the maintainer I am happy to change to your style if you would like?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I don't feel strongly about it. Won't matter. I find the for loop simpler but I get the Scala construct here

@srowen srowen added the bug label May 25, 2020
@srowen srowen added this to the 0.10.0 milestone May 25, 2020
- whilst not technically XSD valid it was the previous behaviour
- also moved out the list of valid formatters as creating them can be quite expensive and doing per value did not make sense
@seddonm1
Copy link
Copy Markdown
Contributor Author

To confirm this should continue to work as before when a timezone is supplied right? will just default to UTC when not specified? that's good.

Correct. You can see examples in the tests.

Are there any inputs that fail to parse now that would parse before, that you know of? I'd favor keeping as much behavior as possible without supporting the ambiguous local timezone-dependent parsing of course.

I have re-added support for 2002-05-30 21:46:54 style dates. These are almost ISO_LOCAL_DATE_TIME format but missing the T separator. This required a minor refactor to prevent creating the formatter objects each value.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2020

Codecov Report

Merging #448 into master will decrease coverage by 0.14%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #448      +/-   ##
==========================================
- Coverage   88.76%   88.61%   -0.15%     
==========================================
  Files          17       17              
  Lines         810      826      +16     
  Branches       79       77       -2     
==========================================
+ Hits          719      732      +13     
- Misses         91       94       +3     
Impacted Files Coverage Δ
...scala/com/databricks/spark/xml/util/TypeCast.scala 82.69% <83.33%> (-0.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97fa657...9c34c41. Read the comment docs.

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented May 25, 2020

Again just FYI @HyukjinKwon , no action needed unless you spot anything that jumps out as problematic here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants