Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Clone HadoopConf to avoid cross usage of tags while parsing the xml#582

Merged
srowen merged 1 commit intodatabricks:masterfrom
sandeep-katta0102:master
Jun 2, 2022
Merged

Clone HadoopConf to avoid cross usage of tags while parsing the xml#582
srowen merged 1 commit intodatabricks:masterfrom
sandeep-katta0102:master

Conversation

@sandeep-katta0102
Copy link
Copy Markdown
Contributor

@sandeep-katta0102 sandeep-katta0102 commented Jun 2, 2022

This code is to fix the issue 581.

Added unit tests and also verified manually by using below code

import scala.collection.JavaConverters._
import scala.collection.mutable
val jobGroudId_ages = mutable.Set[Long]()

val threads_ages = (1001 to 1010).map { i =>
  new Thread {
    override def run() {
      sc.setJobGroup(s"$i", s"$i")
      val df = spark.read.option("rowTag", "person").format("xml").load("file:/Users/XXXX/spark-xml/src/test/resources/ages.xml") 
      if(df.schema.fields.isEmpty) {
        println(s"found repro for the ages run $i **********************")
        jobGroudId_ages.add(i)
      }
    }
  }
}


import scala.collection.JavaConverters._
import scala.collection.mutable
val jobGroudId_books = mutable.Set[Long]()

val threads = (1 to 10).map { i =>
  new Thread {
    override def run() {
      sc.setJobGroup(s"$i", s"$i")
      val df = spark.read.option("rowTag", "book").format("xml").load("file:/Users/XXXX/spark-xml/src/test/resources/books.xml") 
      if(df.schema.fields.isEmpty) {
        println(s"found repro for the book run $i **********************")
        jobGroudId_books.add(i)
      }
    }
  }
}

threads_ages.foreach(_.start())
threads.foreach(_.start())
threads_ages.foreach(_.join())
threads.foreach(_.join())
println(s" jobGroudId_books is ${jobGroudId_books.size} ")
println(s" jobGroudId_ages is ${jobGroudId_ages.size} ")

Before fix
image

After fix
image

Copy link
Copy Markdown
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sandeep-katta0102

@HyukjinKwon
Copy link
Copy Markdown
Member

cc @srowen FYI if you find some time to take a look 🙏

@srowen srowen merged commit 1e25d7b into databricks:master Jun 2, 2022
@srowen srowen added the bug label Jun 2, 2022
@srowen srowen added this to the 0.15.0 milestone Jun 2, 2022
@HyukjinKwon
Copy link
Copy Markdown
Member

@srowen just out of curiosity, when do we roughly plan to have the next release?

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Jun 3, 2022

No particular schedule -- on demand. Is this is a sorta important fix? it's easy to roll a new release, and it has been 7 months or so since the last one, so seems OK to me.

@HyukjinKwon
Copy link
Copy Markdown
Member

HyukjinKwon commented Jun 3, 2022

not super critical but I think it's good to have one ... could we make a release maybe? I will take a look and try the release around next week if you couldn't find to take a look 👍

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Jun 3, 2022

OK I can do it tomorrow I think

@HyukjinKwon
Copy link
Copy Markdown
Member

Thank you

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Jun 3, 2022

Done, 0.15.0 is released with this change

@HyukjinKwon
Copy link
Copy Markdown
Member

Thanks!!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants