Hi ,
I am trying to generate the schema from a complex XML column in pyspark dataframe using the function below.
def ext_schema_of_xml_df(df, options={}):
assert len(df.columns) == 1
scala_options = spark._jvm.PythonUtils.toScalaMap(options)
java_xml_module = getattr(getattr(
spark._jvm.com.databricks.spark.xml, "package$"), "MODULE$")
java_schema = java_xml_module.schema_of_xml_df(df._jdf, scala_options)
return _parse_datatype_json_string(java_schema.json())
When trying to generate the schema with the field value '8E9N' data type returned as StringType but when trying with value that ends with letter 'D' or 'F' (exa - '8E9D', '8E8F') datatype returned as DoubleType . Ideally it should also be treated as StringType.
Kindly find attached the screenshot and code to reproduce the issue.



# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])
# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
'concat_ws("", '
' "<Root>", '
' "<contract>", '
' "<contract_num>8E9N</contract_num>", '
' "</contract>", '
' "</Root>"'
')'
))
payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)
# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])
# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
'concat_ws("", '
' "<Root>", '
' "<contract>", '
' "<contract_num>8E9D</contract_num>", '
' "</contract>", '
' "</Root>"'
')'
))
payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)
# Create a DataFrame with a single column
df = spark.createDataFrame([(1,)], ["id"])
# Create an XML column with the desired value
df = df.withColumn("XML_Column", expr(
'concat_ws("", '
' "<Root>", '
' "<contract>", '
' "<contract_num>8E8F</contract_num>", '
' "</contract>", '
' "</Root>"'
')'
))
payloadSchema = ext_schema_of_xml_df(df.select("XML_Column"))
print(payloadSchema)
Hi ,
I am trying to generate the schema from a complex XML column in pyspark dataframe using the function below.
When trying to generate the schema with the field value '8E9N' data type returned as StringType but when trying with value that ends with letter 'D' or 'F' (exa - '8E9D', '8E8F') datatype returned as DoubleType . Ideally it should also be treated as StringType.
Kindly find attached the screenshot and code to reproduce the issue.