Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Generated files does not have .xml extension #664

@dolfinus

Description

@dolfinus

Hi.

I've created simple dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType
from datetime import datetime, timezone

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])

df = spark.createDataFrame([{"created-at": datetime.now(tz=timezone.utc)}], schema=schema)
df.show(10, False)

df.write.format("xml").option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSXXX").mode("overwrite").save("2.xml")
created-at
2023-10-09 09:05:24.269352

Then saved it as xml:

df.repartition(1).write \
  .format("xml") \
  .mode("overwrite") \
  .option("compression", None) \
  .option("rowTag", "item") \
  .save("2.xml")

This is content of 2.xml folder:

> ls -la 2.xml
drwxr-xr-x  2 maxim maxim   84 окт  9 09:18 ./
drwxr-xr-x 19 maxim maxim 4096 окт  9 09:18 ../
-rw-r--r--  1 maxim maxim  156 окт  9 09:18 part-00000
-rw-r--r--  1 maxim maxim   12 окт  9 09:18 .part-00000.crc
-rw-r--r--  1 maxim maxim    0 окт  9 09:18 _SUCCESS
-rw-r--r--  1 maxim maxim    8 окт  9 09:18 ._SUCCESS.crc

File 2.xml/part-00000 has the following content:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROWS>
    <item>
        <created-at>2023-10-09T09:05:24.269352Z</created-at>
    </item>
</ROWS>

But it does not have .xml extension. Is that an expected behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions