Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

XML Timestamp parsing without timezone #612

@JorisTruong

Description

@JorisTruong

Description

Currently encountering some issue when parsing with a specific timestamp format. I have XML data with timestamp of this format: yyyy/MM/dd HH:mm:ss, and I am trying to read it using the timestampFormat option.

Looking at the parseXmlTimestamp() function in TypeCast.scala, it seems that spark-xml will be using Timestamp.from(ZonedDateTime.parse(value, format).toInstant) to parse the timestamp. However, with a format like yyyy/MM/dd HH:mm:ss, there is no timezone, which causes the function to fail:
image
It will then return null values for the whole timestamp column.

To reproduce

time.xml file:

<book>
    <author>John Smith</author>
    <time>2011-12-03T10:15:30Z</time>
    <time2>2011/03/12 10:15:30</time2>
</book>

PySpark code to read:

df = spark.read \
        .format("com.databricks.spark.xml") \
        .options(rowTag="book", timestampFormat="yyyy/MM/dd HH:mm:ss") \
        .schema("author STRING, time TIMESTAMP, time2 TIMESTAMP") \
        .load("<path_to_file>")

display(df)
df.printSchema()

Result:

author time time2
John Smith 2011-12-03T10:15:30.000+0000 null

Suggestion

We may want to add a default timezone in the parseXmlTimestamp() function, like this:
image

Any thoughts about adding a timeZone option?

private[xml] class XmlOptions(
    @transient private val parameters: Map[String, String])
  extends Serializable {

  def this() = this(Map.empty)

  val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)
  ...
  val timestampFormat = parameters.get("timestampFormat")
  val timeZone = parameters.getOrElse("timeZone", XmlOptions.DEFAULT_TIME_ZONE) // <----- HERE
  val dateFormat = parameters.get("dateFormat")
}

private[xml] object XmlOptions {
  val DEFAULT_ATTRIBUTE_PREFIX = "_"
  ...
  val DEFAULT_TIME_ZONE = "UTC" // <----- HERE

  def apply(parameters: Map[String, String]): XmlOptions = new XmlOptions(parameters)
}
  private def parseXmlTimestamp(value: String, options: XmlOptions): Timestamp = {
    val timeZone = options.timeZone // <----- HERE
    val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
      map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
    formatters.foreach { format =>
      try {
        return Timestamp.from(ZonedDateTime.parse(value, format.withZone(ZoneId.of(timeZone))).toInstant) // <----- HERE
      } catch {
        case _: Exception => // continue
      }
    }
    throw new IllegalArgumentException(s"cannot convert value $value to Timestamp")
  }

The following will make the parseXmlTimestamp() function returns this:
image

which then will return this DataFrame when reading:

author time time2
John Smith 2011-12-03T10:15:30.000+0000 2011-12-03T10:15:30.000+0000

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions