Hi.
I've created simple dataframe:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType
from datetime import datetime, timezone
spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])
df = spark.createDataFrame([{"created-at": datetime.now(tz=timezone.utc)}], schema=schema)
df.show(10, False)
| created-at |
| 2023-10-09 09:05:24.269352 |
Then I try to save it as xml:
df.repartition(1).write \
.format("xml") \
.mode("overwrite") \
.option("compression", None) \
.option("rowTag", "item") \
.save("2.xml")
Resulting xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROWS>
<item>
<created-at>2023-10-09T09:05:24.269352Z</created-at>
</item>
</ROWS>
Then I want to change timestamp format:
df.repartition(1).write \
.format("xml") \
.mode("overwrite") \
.option("compression", None) \
.option("rowTag", "item") \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSXXX") \
.save("2.xml")
I got an exception:
23/10/09 09:13:31 ERROR Utils: Aborting task
java.time.temporal.UnsupportedTemporalTypeException: Unsupported field: YearOfEra
at java.time.Instant.getLong(Instant.java:603)
at java.time.format.DateTimePrintContext.getValue(DateTimePrintContext.java:298)
at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2551)
at java.time.format.DateTimeFormatterBuilder$CompositePrinterParser.format(DateTimeFormatterBuilder.java:2190)
at java.time.format.DateTimeFormatter.formatTo(DateTimeFormatter.java:1746)
at java.time.format.DateTimeFormatter.format(DateTimeFormatter.java:1720)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:89)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChildElement$1(StaxXmlGenerator.scala:57)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChild$1(StaxXmlGenerator.scala:79)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$12(StaxXmlGenerator.scala:130)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$12$adapted(StaxXmlGenerator.scala:128)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:128)
at com.databricks.spark.xml.parsers.StaxXmlGenerator$.apply(StaxXmlGenerator.scala:155)
at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:134)
at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:111)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:137)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1563)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
23/10/09 09:13:31 ERROR SparkHadoopWriter: Task attempt_20231009091331224191220077987097_0471_m_000000_0 aborted.
23/10/09 09:13:31 ERROR Executor: Exception in task 0.0 in stage 79.0 (TID 131)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:163)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This looks like https://stackoverflow.com/a/27483371 and caused by this line:
|
val formatter = options.timestampFormat.map(DateTimeFormatter.ofPattern). |
There is no such error if I pass custom timestamoFormat during reading, and this is probably fixed here:
|
DateTimeFormatter.ofPattern(formatString).withZone(options.timezone.map(ZoneId.of).orNull) |
Hi.
I've created simple dataframe:
Then I try to save it as xml:
Resulting xml:
Then I want to change timestamp format:
I got an exception:
This looks like https://stackoverflow.com/a/27483371 and caused by this line:
spark-xml/src/main/scala/com/databricks/spark/xml/parsers/StaxXmlGenerator.scala
Line 87 in b2611bd
There is no such error if I pass custom
timestamoFormatduring reading, and this is probably fixed here:spark-xml/src/main/scala/com/databricks/spark/xml/util/TypeCast.scala
Line 155 in b2611bd