Detection of corrupted rows in XML

Hi all, 

What is the correct way of detecting/handling corrupted records in a XML file? 

We've an XML file that we're trying to parse, it contains data that could be one of the following types: `Double`, `String`, or `Array<Struct>`. We've been successful in parsing the data correctly with our custom schema. However, when we try detecting the corrupted records, by adding a new column to our custom schema called `_corrupt_record` and setting the option `columnNameOfCorruptRecord` to `_corrupt_record`, we get random exceptions like: `The value (1.0) of the type (java.lang.Double) cannot be converted to an array of struct<_VALUE:string, _m:int>`, or `The value (-10000.0) of the type (java.lang.Double) cannot be converted to the string type`.

Also, we've noted that the order of the columns in the resulted Dataframe doesn't match the submitted schema, when we added the `columnNameOfCorruptRecord`, and the new `_corrupt_record` column to our custom schema.

Here's a sample of our XML file data:

```
<row>
    <c1>data1</c1>
    <c2>data2</c2>
    <c2 m=2>data2_m2</c2>
    <c3 m=2>1234</c3>
</row>
```

The corresponding schema to this sample should be the following:
```
root
 |-- c1: string (nullable = true)
 |-- c2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string(nullable = true)
 |    |    |-- _m: integer (nullable = true)
 |-- c3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: double(nullable = true)
 |    |    |-- _m: integer (nullable = true)
 ```
 
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection of corrupted rows in XML #517

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detection of corrupted rows in XML #517

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions