Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Detection of corrupted rows in XML #517

@mahkhalil

Description

@mahkhalil

Hi all,

What is the correct way of detecting/handling corrupted records in a XML file?

We've an XML file that we're trying to parse, it contains data that could be one of the following types: Double, String, or Array<Struct>. We've been successful in parsing the data correctly with our custom schema. However, when we try detecting the corrupted records, by adding a new column to our custom schema called _corrupt_record and setting the option columnNameOfCorruptRecord to _corrupt_record, we get random exceptions like: The value (1.0) of the type (java.lang.Double) cannot be converted to an array of struct<_VALUE:string, _m:int>, or The value (-10000.0) of the type (java.lang.Double) cannot be converted to the string type.

Also, we've noted that the order of the columns in the resulted Dataframe doesn't match the submitted schema, when we added the columnNameOfCorruptRecord, and the new _corrupt_record column to our custom schema.

Here's a sample of our XML file data:

<row>
    <c1>data1</c1>
    <c2>data2</c2>
    <c2 m=2>data2_m2</c2>
    <c3 m=2>1234</c3>
</row>

The corresponding schema to this sample should be the following:

root
 |-- c1: string (nullable = true)
 |-- c2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string(nullable = true)
 |    |    |-- _m: integer (nullable = true)
 |-- c3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: double(nullable = true)
 |    |    |-- _m: integer (nullable = true)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions