Hello Spark-xml team,
We have been using spark-xml a lot, thanks for the package!
Recently trying spark-xml with ISO_8859-1 encoded XML that contains Nordic characters, e.g
name="Företagsinteckningar"
Could not get the Nordic characters to parse correctly, despite setting the charset to ISO_8859-1.
Traced the issue to this method in com.databricks.spark.xml.util.XmlFile:
def withCharset(
context: SparkContext,
location: String,
charset: String,
rowTag: String): RDD[String] = {
// This just checks the charset's validity early, to keep behavior
Charset.forName(charset)
context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, charset)
context.newAPIHadoopFile(location,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text]).map { case (_, text) => new String(text.getBytes, 0, text.getLength, charset) }
}
Problem with (_, text) => new String(text.getBytes, 0, text.getLength, charset) seems to be that the text already contains UTF8 encoded bytes, populated by Xml Parser, it is not charset encoded.
After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly.
This seems to be a common problem for parsing xml with non ascii characters, would be great to get the fix into the package.
Thanks
Hello Spark-xml team,
We have been using spark-xml a lot, thanks for the package!
Recently trying spark-xml with ISO_8859-1 encoded XML that contains Nordic characters, e.g
name="Företagsinteckningar"
Could not get the Nordic characters to parse correctly, despite setting the charset to ISO_8859-1.
Traced the issue to this method in com.databricks.spark.xml.util.XmlFile:
def withCharset(
context: SparkContext,
location: String,
charset: String,
rowTag: String): RDD[String] = {
// This just checks the charset's validity early, to keep behavior
Charset.forName(charset)
context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, charset)
context.newAPIHadoopFile(location,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text]).map { case (_, text) => new String(text.getBytes, 0, text.getLength, charset) }
}
Problem with (_, text) => new String(text.getBytes, 0, text.getLength, charset) seems to be that the text already contains UTF8 encoded bytes, populated by Xml Parser, it is not charset encoded.
After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly.
This seems to be a common problem for parsing xml with non ascii characters, would be great to get the fix into the package.
Thanks