Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Can't Parse Encoding With Non Ascii Characters #510

@xshenigx1

Description

@xshenigx1

Hello Spark-xml team,

We have been using spark-xml a lot, thanks for the package!

Recently trying spark-xml with ISO_8859-1 encoded XML that contains Nordic characters, e.g

name="Företagsinteckningar"

Could not get the Nordic characters to parse correctly, despite setting the charset to ISO_8859-1.

Traced the issue to this method in com.databricks.spark.xml.util.XmlFile:

def withCharset(
context: SparkContext,
location: String,
charset: String,
rowTag: String): RDD[String] = {
// This just checks the charset's validity early, to keep behavior
Charset.forName(charset)
context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>")
context.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, charset)
context.newAPIHadoopFile(location,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text]).map { case (_, text) => new String(text.getBytes, 0, text.getLength, charset) }
}

Problem with (_, text) => new String(text.getBytes, 0, text.getLength, charset) seems to be that the text already contains UTF8 encoded bytes, populated by Xml Parser, it is not charset encoded.

After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly.

This seems to be a common problem for parsing xml with non ascii characters, would be great to get the fix into the package.

Thanks

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions