Can't Parse Encoding With Non Ascii Characters

Hello Spark-xml team,

We have been using spark-xml a lot, thanks for the package!

Recently trying spark-xml with ISO_8859-1 encoded XML that contains Nordic characters, e.g 

name="Företagsinteckningar"

Could not get the Nordic characters to parse correctly, despite setting the charset to ISO_8859-1.

Traced the issue to this method in com.databricks.spark.xml.util.XmlFile:

def withCharset(
      context: SparkContext,
      location: String,
      charset: String,
      rowTag: String): RDD[String] = {
    // This just checks the charset's validity early, to keep behavior
    Charset.forName(charset)
    context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>")
    context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>")
    context.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, charset)
    context.newAPIHadoopFile(location,
      classOf[XmlInputFormat],
      classOf[LongWritable],
      classOf[Text]).map { case (_, text) => new String(text.getBytes, 0, text.getLength, charset) }
  }

Problem with (_, text) => new String(text.getBytes, 0, text.getLength, charset) seems to be that the text already contains UTF8 encoded bytes, populated by Xml Parser, it is not charset encoded.

After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly.

This seems to be a common problem for parsing xml with non ascii characters, would be great to get the fix into the package.

Thanks



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't Parse Encoding With Non Ascii Characters #510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't Parse Encoding With Non Ascii Characters #510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions