Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Timestamps not matching format are replaced with nulls #662

@dolfinus

Description

@dolfinus

Hi.

I'm trying to parse simple xml file:

<item>
  <created-at>2021-01-01T01:01:01+00:00</created-at>
</item>
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, TimestampType

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.17.0").getOrCreate()
schema = StructType([StructField("created-at", TimestampType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("1.xml").show()

Result:

created-at
2021-01-01 01:01:01

But if timestamp does not match format, e.g. T is replaced with space:

<item>
  <created-at>2021-01-01 01:01:01+00:00</created-at>
</item>

It is read as null:

created-at
null

I see that there is an option mode with PERMISSIVE as default, which leads to when it encounters a field of the wrong datatype, it sets the offending field to null. But malformed value is not being added to column _corrupt_record because there is nothing wrong with xml structure.
So there is no way to detect if input file contains tag with wrong field value or nullValue, unless user set a different mode.
Is that desired behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions