Hi all,
What is the correct way of detecting/handling corrupted records in a XML file?
We've an XML file that we're trying to parse, it contains data that could be one of the following types: Double, String, or Array<Struct>. We've been successful in parsing the data correctly with our custom schema. However, when we try detecting the corrupted records, by adding a new column to our custom schema called _corrupt_record and setting the option columnNameOfCorruptRecord to _corrupt_record, we get random exceptions like: The value (1.0) of the type (java.lang.Double) cannot be converted to an array of struct<_VALUE:string, _m:int>, or The value (-10000.0) of the type (java.lang.Double) cannot be converted to the string type.
Also, we've noted that the order of the columns in the resulted Dataframe doesn't match the submitted schema, when we added the columnNameOfCorruptRecord, and the new _corrupt_record column to our custom schema.
Here's a sample of our XML file data:
<row>
<c1>data1</c1>
<c2>data2</c2>
<c2 m=2>data2_m2</c2>
<c3 m=2>1234</c3>
</row>
The corresponding schema to this sample should be the following:
root
|-- c1: string (nullable = true)
|-- c2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string(nullable = true)
| | |-- _m: integer (nullable = true)
|-- c3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: double(nullable = true)
| | |-- _m: integer (nullable = true)
Hi all,
What is the correct way of detecting/handling corrupted records in a XML file?
We've an XML file that we're trying to parse, it contains data that could be one of the following types:
Double,String, orArray<Struct>. We've been successful in parsing the data correctly with our custom schema. However, when we try detecting the corrupted records, by adding a new column to our custom schema called_corrupt_recordand setting the optioncolumnNameOfCorruptRecordto_corrupt_record, we get random exceptions like:The value (1.0) of the type (java.lang.Double) cannot be converted to an array of struct<_VALUE:string, _m:int>, orThe value (-10000.0) of the type (java.lang.Double) cannot be converted to the string type.Also, we've noted that the order of the columns in the resulted Dataframe doesn't match the submitted schema, when we added the
columnNameOfCorruptRecord, and the new_corrupt_recordcolumn to our custom schema.Here's a sample of our XML file data:
The corresponding schema to this sample should be the following: