I have a question similar to issue #227.
I have tried to load following XML File using pyspark. The data contain a tag with an empty value for the numeric volume field.
I would like to this field to be filled with null, but I receive 0.0.
Here is my sample data:
<?xml version="1.0" encoding="UTF-8" ?>
<TEST>
<Header Name="X" SequenceNumber="123"/>
<T Number="1" Volume="20"/>
<T Number="2" Volume=""/>
</TEST>
I have tried following code
from pyspark.sql.types import StringType, LongType, DoubleType, StructType, StructField, ArrayType
from pyspark.sql.functions import explode
XMLSchema = ( StructType (
[StructField("Header", StructType(
[StructField("_Name", StringType(), True)
,StructField("_SequenceNumber", LongType(), True)
]), True)
,StructField("T", ArrayType(StructType(
[StructField("_Number", LongType(), True)
,StructField("_Volume", DoubleType(), True)
,StructField("_VALUE", DoubleType(), True)
]), True), True)
]))
df = ( spark.read.format('com.databricks.spark.xml')
.option("rowTag", "TEST")
.option("nullValue", "")
.schema(XMLSchema)
.load("mnt/test/Alfred/A.xml")
.select("Header",explode("T"))
)
df.printSchema()
df.show()
With following result:
root
|-- Header: struct (nullable = true)
| |-- _Name: string (nullable = true)
| |-- _SequenceNumber: long (nullable = true)
|-- col: struct (nullable = true)
| |-- _Number: long (nullable = true)
| |-- _Volume: double (nullable = true)
| |-- _VALUE: double (nullable = true)
+--------+---------------+
| Header| col|
+--------+---------------+
|{X, 123}|{1, 20.0, null}|
|{X, 123}| {2, 0.0, null}|
+--------+---------------+
If I simply omit the "nullValue" option in my code, the line is skipped completely.
How can I achieve that.
Expected result:
+--------+---------------+
| Header| col|
+--------+---------------+
|{X, 123}|{1, 20.0, null}|
|{X, 123}|{2, null, null}|
+--------+---------------+
I have a question similar to issue #227.
I have tried to load following XML File using pyspark. The data contain a tag with an empty value for the numeric volume field.
I would like to this field to be filled with null, but I receive 0.0.
Here is my sample data:
I have tried following code
With following result:
If I simply omit the "nullValue" option in my code, the line is skipped completely.
How can I achieve that.
Expected result: