Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

No error raised if xml has only one failed row #436

@Mimetis

Description

@Mimetis

Hi.

We are making some failure tests with spark-xml package within Databricks
We discovered that a xml file containing only one failing row will throw nothing.

Let me explain it with a simple example.
I've used your /tests samples for reproduction:

Here a good malformatted xml file:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id='I'm malformed too!'>
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

The test is using the FAILFAST option, but the default PERMISSIVE option will have the same behavior at the end.

So far, the test is:

# COMMAND ----------
file_location = "/FileStore/tables/books_malformed_attributes-345be.xml"

df = (spark.read.format("xml")
  .option("rowTag", "book")
  .option("mode", "FAILFAST")    
  .load(file_location))

display(df)

and the expected result is:

SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>

But now, if I have a file with only one failing row, like this:

<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
</catalog>

Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:

(2) Spark Jobs
df:pyspark.sql.dataframe.DataFrame
OK

image

Any thought ?

Additional question : Just to know, do you support the databricks option "badRecordsPath" ?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions