No error raised if xml has only one failed row

Hi.

We are making some *failure* tests with `spark-xml` package within **Databricks**
We discovered that a xml file containing only one failing row will throw nothing.

Let me explain it with a simple example.  
I've used your */tests* samples for reproduction:

Here a *good* malformatted xml file:

```xml
<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id='I'm malformed too!'>
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

```

The test is using the `FAILFAST` option, but the default `PERMISSIVE` option will have the same behavior at the end.

So far, the test is:

``` py
# COMMAND ----------
file_location = "/FileStore/tables/books_malformed_attributes-345be.xml"

df = (spark.read.format("xml")
  .option("rowTag", "book")
  .option("mode", "FAILFAST")    
  .load(file_location))

display(df)
```
and the expected result is:

``` txt
SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter ">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
```

But now, if I have a file with only one *failing* row, like this:

``` xml
<?xml version="1.0"?>
<catalog>
   <book id="Malformed attribute with " caracter ">
</catalog>
```

Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:

``` txt
(2) Spark Jobs
df:pyspark.sql.dataframe.DataFrame
OK
```
![image](https://user-images.githubusercontent.com/4592555/74345746-d61b8f00-4dae-11ea-9551-58c875f2bf81.png)

Any thought ?

**Additional question** : Just to know, do you support the databricks option *"badRecordsPath"* ?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No error raised if xml has only one failed row #436

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No error raised if xml has only one failed row #436

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions