Allow custom timestamp with Spark timezone property#621
Allow custom timestamp with Spark timezone property#621srowen merged 17 commits intodatabricks:masterfrom
Conversation
| try { | ||
| return Timestamp.from(ZonedDateTime.parse(value, format).toInstant) | ||
| // If format is not in supported and no timezone in format, use default Spark timezone | ||
| if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) { |
There was a problem hiding this comment.
Rather than do this, just break this method into two checks - one loop over built-in formats, then the custom format with special TZ handling
There was a problem hiding this comment.
Do you mean we should use two methods? If I understand correctly,
val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
formatters.foreach { format =>
This block of code means that the current parseXmlTimestamp() method already loops over built-in formats + the custom format. It is just that if the specified format does not contains a timezone, the custom format does not work. So why break it if the original idea is to loop over all formats?
If you meant just break in the method, doesn't the if branch does it?
There was a problem hiding this comment.
It's all about the same, but why combine two types of input that need different processing and then re-differentiate in a loop? handle the loop, then handle the special case. I think it's more straightforward, maybe not.
| // If format is not in supported and no timezone in format, use default Spark timezone | ||
| if (!supportedXmlTimestampFormatters.contains(format) && Option(format.getZone).isEmpty) { | ||
| return Timestamp.from( | ||
| ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant |
There was a problem hiding this comment.
Need to deal with timezone not being specified - I suppose throw an error explicitly?
There was a problem hiding this comment.
With the current approach to put the timezone in XmlRelation of createRelation(), isn't it better the create a default timezone in case spark.sql.session.timeZone is not set?
There was a problem hiding this comment.
Hm, maybe, I just feel like that's an error. You are parsing times that are ambiguous without a timzeone, and didn't give a timezone - just assuming "UTC" or something doesn't quite seem right vs highlighting the error.
… format processing
| // Custom format without timezone or offset | ||
| return Timestamp.from( | ||
| ZonedDateTime.parse(value, format.withZone(ZoneId.of(options.timezone.get))).toInstant | ||
| ) |
There was a problem hiding this comment.
We can check if there is a timezone but there is no method to determine if an offset is defined
There was a problem hiding this comment.
Hm. What if we parse, and if it fails with the error you are facing, add 'withZone'? the only problem is that may be a huge amount of perf overhead, so we'd have to have some way to test and store the format with the zone, only if it's needed. Not pretty but that would be the way forward
There was a problem hiding this comment.
When parsing fails, it will always throw a DateTimeParseException, no matter if it is because of the problem we are talking about or not. Rather than parsing the error message, I wrote a isParseableAsZonedDateTime() method and used it during the loop.
| DateTimeFormatter.ISO_OFFSET_DATE_TIME, | ||
| // 2002-05-30T21:46:54.1234Z | ||
| DateTimeFormatter.ISO_INSTANT | ||
| DateTimeFormatter.ISO_INSTANT.withZone(ZoneId.of("UTC")) |
There was a problem hiding this comment.
This should already be UTC; I don't think we want to change this
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT
There was a problem hiding this comment.
I haven't been able to run Timestamp.from(ZonedDateTime.parse(value, DateTimeFormatter.ISO_INSTANT).toInstant). The doc also says:
As such, an Instant cannot be formatted as a date or time without providing some form of time-zone.
I feel like that ISO_INSTANT should not be in supportedXmlTimestampFormatters.
Same issue here: https://stackoverflow.com/questions/25612129/java-8-datetimeformatter-and-iso-instant-issues-with-zoneddatetime
There was a problem hiding this comment.
I think we need to support the format; I think I copied these from a list of standard formats according to XSD specs or something. It's super standard.
We're parsing rather than formatting here. What's the issue - does it not work with ZonedDateTime? maybe never did, if so.
There was a problem hiding this comment.
Hm OK I see why you made that change, it doesn't seem to think it's "UTC" otherwise, when it should be by nature. OK leave it in
There was a problem hiding this comment.
OK different idea - what if we write the parsing without ZonedDateTime? Timestamp.from(Instant.from(format.parse(value))) Does that help? seems to be simpler, not sure why it wasn't written that way in the first place
| } | ||
| } | ||
|
|
||
| private[xml] def isParseableAsZonedDateTime(value: String, |
There was a problem hiding this comment.
For later, we might have to figure out a way to cache the parsing of a custom format into a pattern, and maybe this check, because it'll happen for every single row
|
Take a look at this change -- I think the core of this works? maybe adapt this approach |
|
I think you have the best answer; I added some more tests in the pull request. I'll try to look into why tests are failing though |
|
I think I figured out the test failure - tiny but subtle issue in handling the param map. See my latest push |
srowen
left a comment
There was a problem hiding this comment.
Looking good to me otherwise
| } | ||
| options.timestampFormat.foreach { formatString => | ||
| // Check if there is offset or timezone and apply Spark timeZone if not | ||
| val hasTemporalInformation = formatString.indexOf("V") + |
There was a problem hiding this comment.
I'm not sure we need this - I found that the docs for "withZone" say that it's ignored if the pattern contains timezone info. So it seems like it will just be a default. OK to write a test for that though!
There was a problem hiding this comment.
Yes I also saw this, but then I encountered some problems between Java 8 and Java 11. We can see in Java 8 here that zone is prioritized over offset, but in Java 9 here, offset is prioritized over zone. I made an example below to see the differences:
So I don't think we can always apply withZone(), and that's why I wanted to use hasTemporalInformation. unfortunately, I haven't been able to think of a cleaner way to do this
There was a problem hiding this comment.
OK, let's leave it this way for now. If you have a sec, throw in a comment about the Java version


Related to issue #612 and to previous pull request #616.
There are still some issues as
spark.sql.session.timeZoneuses Java'sTimeZone.getDefault.getIDaccording to the source code here, and it can result in a null value.As a result, it will be mandatory to set
spark.sql.session.timeZone, otherwisespark-xmlwill throw anNoSuchElementExceptionwhen trying to retrieve the Spark property withspark.conf.get()method. Can reproduce this when running theXmlPartitioningSuite.We may still need a default value for the timezone.