Description
Currently encountering some issue when parsing with a specific timestamp format. I have XML data with timestamp of this format: yyyy/MM/dd HH:mm:ss, and I am trying to read it using the timestampFormat option.
Looking at the parseXmlTimestamp() function in TypeCast.scala, it seems that spark-xml will be using Timestamp.from(ZonedDateTime.parse(value, format).toInstant) to parse the timestamp. However, with a format like yyyy/MM/dd HH:mm:ss, there is no timezone, which causes the function to fail:

It will then return null values for the whole timestamp column.
To reproduce
time.xml file:
<book>
<author>John Smith</author>
<time>2011-12-03T10:15:30Z</time>
<time2>2011/03/12 10:15:30</time2>
</book>
PySpark code to read:
df = spark.read \
.format("com.databricks.spark.xml") \
.options(rowTag="book", timestampFormat="yyyy/MM/dd HH:mm:ss") \
.schema("author STRING, time TIMESTAMP, time2 TIMESTAMP") \
.load("<path_to_file>")
display(df)
df.printSchema()
Result:
| author |
time |
time2 |
| John Smith |
2011-12-03T10:15:30.000+0000 |
null |
Suggestion
We may want to add a default timezone in the parseXmlTimestamp() function, like this:

Any thoughts about adding a timeZone option?
private[xml] class XmlOptions(
@transient private val parameters: Map[String, String])
extends Serializable {
def this() = this(Map.empty)
val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)
...
val timestampFormat = parameters.get("timestampFormat")
val timeZone = parameters.getOrElse("timeZone", XmlOptions.DEFAULT_TIME_ZONE) // <----- HERE
val dateFormat = parameters.get("dateFormat")
}
private[xml] object XmlOptions {
val DEFAULT_ATTRIBUTE_PREFIX = "_"
...
val DEFAULT_TIME_ZONE = "UTC" // <----- HERE
def apply(parameters: Map[String, String]): XmlOptions = new XmlOptions(parameters)
}
private def parseXmlTimestamp(value: String, options: XmlOptions): Timestamp = {
val timeZone = options.timeZone // <----- HERE
val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
formatters.foreach { format =>
try {
return Timestamp.from(ZonedDateTime.parse(value, format.withZone(ZoneId.of(timeZone))).toInstant) // <----- HERE
} catch {
case _: Exception => // continue
}
}
throw new IllegalArgumentException(s"cannot convert value $value to Timestamp")
}
The following will make the parseXmlTimestamp() function returns this:

which then will return this DataFrame when reading:
| author |
time |
time2 |
| John Smith |
2011-12-03T10:15:30.000+0000 |
2011-12-03T10:15:30.000+0000 |
Description
Currently encountering some issue when parsing with a specific timestamp format. I have XML data with timestamp of this format:
yyyy/MM/dd HH:mm:ss, and I am trying to read it using thetimestampFormatoption.Looking at the

parseXmlTimestamp()function in TypeCast.scala, it seems thatspark-xmlwill be usingTimestamp.from(ZonedDateTime.parse(value, format).toInstant)to parse the timestamp. However, with a format likeyyyy/MM/dd HH:mm:ss, there is no timezone, which causes the function to fail:It will then return null values for the whole timestamp column.
To reproduce
time.xmlfile:PySpark code to read:
Result:
Suggestion
We may want to add a default timezone in the

parseXmlTimestamp()function, like this:Any thoughts about adding a
timeZoneoption?The following will make the

parseXmlTimestamp()function returns this:which then will return this DataFrame when reading: