Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Reading multiple xml files in parallel results invalid schema for the xml #581

@sandeep-katta0102

Description

@sandeep-katta0102

There is an issue w.r.t xml connector, if 2 xml files are read at same time then there is a high possibility that one of the xml doesn't parse the schema. This issue is because of the below code

context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>")
 context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>")
 context.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, charset)

Steps to reproduce

val failedAgesSet = mutable.Set[Long]()
    val threads_ages = (1 to 10).map { i =>
      new Thread {
        override def run() {
          val df = spark.read.option("rowTag", "person").format("xml")
            .load(resDir + "ages.xml")
          if (df.schema.fields.isEmpty) {
            failedAgesSet.add(i)
          }
        }
      }
    }

    val failedBooksSet = mutable.Set[Long]()
    val threads_books = (11 to 20).map { i =>
      new Thread {
        override def run() {
          val df = spark.read.option("rowTag", "book").format("xml")
            .load(resDir + "books.xml")
          if (df.schema.fields.isEmpty) {
            failedBooksSet.add(i)
          }
        }
      }
    }

    threads_ages.foreach(_.start())
    threads_books.foreach(_.start())
    threads_ages.foreach(_.join())
    threads_books.foreach(_.join())
    assert(failedBooksSet.isEmpty)
    assert(failedAgesSet.isEmpty)

Correct Log

22/05/31 20:53:12 INFO |Executor task launch worker for task 0.0 in stage 6.0 (TID 6)| xml.XmlRecordReader: file is file:/Users/sandeep.katta/sourcecode/databricks/spark-xml/spark-xml/src/test/resources/books.xml:0+5542 and startTag is <book> and endTag is </book>

In-Correct Log which parses incorrect tag book, ideally the tag should be person

22/05/31 20:53:12 INFO |Executor task launch worker for task 0.0 in stage 5.0 (TID 5)| xml.XmlRecordReader: file is file:/Users/sandeep.katta/sourcecode/databricks/spark-xml/spark-xml/src/test/resources/ages.xml:0+265 and startTag is <book> and endTag is </book>

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions