Hi.
We are making some failure tests with spark-xml package within Databricks
We discovered that a xml file containing only one failing row will throw nothing.
Let me explain it with a simple example.
I've used your /tests samples for reproduction:
Here a good malformatted xml file:
<?xml version="1.0"?>
<catalog>
<book id="Malformed attribute with " caracter ">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id='I'm malformed too!'>
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
The test is using the FAILFAST option, but the default PERMISSIVE option will have the same behavior at the end.
So far, the test is:
# COMMAND ----------
file_location = "/FileStore/tables/books_malformed_attributes-345be.xml"
df = (spark.read.format("xml")
.option("rowTag", "book")
.option("mode", "FAILFAST")
.load(file_location))
display(df)
and the expected result is:
SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter ">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
But now, if I have a file with only one failing row, like this:
<?xml version="1.0"?>
<catalog>
<book id="Malformed attribute with " caracter ">
</catalog>
Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:
(2) Spark Jobs
df:pyspark.sql.dataframe.DataFrame
OK

Any thought ?
Additional question : Just to know, do you support the databricks option "badRecordsPath" ?
Hi.
We are making some failure tests with
spark-xmlpackage within DatabricksWe discovered that a xml file containing only one failing row will throw nothing.
Let me explain it with a simple example.
I've used your /tests samples for reproduction:
Here a good malformatted xml file:
The test is using the
FAILFASToption, but the defaultPERMISSIVEoption will have the same behavior at the end.So far, the test is:
and the expected result is:
SparkException: Job aborted due to stage failure: Task 0 in stage 42.0 failed 4 times, most recent failure: Lost task 0.3 in stage 42.0 (TID 73, 10.139.64.4, executor 0): java.lang.IllegalArgumentException: Malformed line in FAILFAST mode: <book id="Malformed attribute with " caracter "> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description>But now, if I have a file with only one failing row, like this:
Now, with the same test, the expected result should be pretty the same as the last test, but unfortunatelly, the result is:
Any thought ?
Additional question : Just to know, do you support the databricks option "badRecordsPath" ?