XML Timestamp parsing without timezone

# Description

Currently encountering some issue when parsing with a specific timestamp format. I have XML data with timestamp of this format: `yyyy/MM/dd HH:mm:ss`, and I am trying to read it using the `timestampFormat` option.

Looking at the `parseXmlTimestamp()` function in [TypeCast.scala](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/TypeCast.scala), it seems that `spark-xml` will be using `Timestamp.from(ZonedDateTime.parse(value, format).toInstant)` to parse the timestamp. However, with a format like `yyyy/MM/dd HH:mm:ss`, there is no timezone, which causes the function to fail:
![image](https://user-images.githubusercontent.com/29867169/208391764-8ddd1af1-a122-45b7-b204-fe130004fd76.png)
It will then return null values for the whole timestamp column.

# To reproduce
`time.xml` file:
```
<book>
    <author>John Smith</author>
    <time>2011-12-03T10:15:30Z</time>
    <time2>2011/03/12 10:15:30</time2>
</book>
```

PySpark code to read:
```
df = spark.read \
        .format("com.databricks.spark.xml") \
        .options(rowTag="book", timestampFormat="yyyy/MM/dd HH:mm:ss") \
        .schema("author STRING, time TIMESTAMP, time2 TIMESTAMP") \
        .load("<path_to_file>")

display(df)
df.printSchema()
```

Result:
|author|time|time2|
|-------|-----|------|
|John Smith|2011-12-03T10:15:30.000+0000|null|

# Suggestion
We may want to add a default timezone in the `parseXmlTimestamp()` function, like this:
![image](https://user-images.githubusercontent.com/29867169/208391973-518872fe-f1be-4238-8d60-a4e6452c77db.png)

Any thoughts about adding a `timeZone` option?
```
private[xml] class XmlOptions(
    @transient private val parameters: Map[String, String])
  extends Serializable {

  def this() = this(Map.empty)

  val charset = parameters.getOrElse("charset", XmlOptions.DEFAULT_CHARSET)
  ...
  val timestampFormat = parameters.get("timestampFormat")
  val timeZone = parameters.getOrElse("timeZone", XmlOptions.DEFAULT_TIME_ZONE) // <----- HERE
  val dateFormat = parameters.get("dateFormat")
}

private[xml] object XmlOptions {
  val DEFAULT_ATTRIBUTE_PREFIX = "_"
  ...
  val DEFAULT_TIME_ZONE = "UTC" // <----- HERE

  def apply(parameters: Map[String, String]): XmlOptions = new XmlOptions(parameters)
}
```

```
  private def parseXmlTimestamp(value: String, options: XmlOptions): Timestamp = {
    val timeZone = options.timeZone // <----- HERE
    val formatters = options.timestampFormat.map(DateTimeFormatter.ofPattern).
      map(supportedXmlTimestampFormatters :+ _).getOrElse(supportedXmlTimestampFormatters)
    formatters.foreach { format =>
      try {
        return Timestamp.from(ZonedDateTime.parse(value, format.withZone(ZoneId.of(timeZone))).toInstant) // <----- HERE
      } catch {
        case _: Exception => // continue
      }
    }
    throw new IllegalArgumentException(s"cannot convert value $value to Timestamp")
  }
```

The following will make the `parseXmlTimestamp()` function returns this:
![image](https://user-images.githubusercontent.com/29867169/208398584-08aa5840-87b4-4ad9-8044-157da446806a.png)

which then will return this DataFrame when reading:
|author|time|time2|
|-------|-----|------|
|John Smith|2011-12-03T10:15:30.000+0000|2011-12-03T10:15:30.000+0000|


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML Timestamp parsing without timezone #612

Description

To reproduce

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

XML Timestamp parsing without timezone #612

Description

Description

To reproduce

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions