Fix for xml expression to not parse arbitrary strings by xanderbailey · Pull Request #679 · databricks/spark-xml

xanderbailey · 2024-04-06T14:07:58Z

Previously it was the case that we would parse the string of the column as the column expression argument for the expression. This leads to being able to execute arbitrary spark SQL if you parse the expression a string literal. It also means that string literals don't work with this expression and it doesn't work within an array transform expression either.

srowen · 2024-04-06T14:21:47Z

This project is now incorporated into Apache Spark. Could you open the pull request there? if it's accepted, I will back-port it here just in case.

Can you give an example of what no longer works (and should not), and/or what didn't work before and does now?

xanderbailey · 2024-04-08T08:56:57Z

Thanks for the quick response @srowen! Looks like it's being added to spark 4.0 here https://github.com/apache/spark/blame/29d077fbbd5464f64e0eeb495f7a955850915cc5/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L7146? Reading this path it seems like it isn't using the same constructor as this library so I think we should be good to just back port this here?

I can't think of an example of something that doesn't work now that would have worked before...

Cases that previously didn't work and do now:

from_xml(new Literal("<root><name>foo</name></root>"), schema) <- string literal is read by the SQL parser as sql which is clearly isn't.
functions.transform(new Column("array"), (element, _index) -> from_xml(element, schema)) <- the variable reference is removed here when you use the SQL parser so this throws when planning.

xanderbailey · 2024-04-10T10:06:45Z

@srowen is there a way for us to tag a new version for this fix?

srowen · 2024-04-10T12:36:35Z

Possibly, but does this block anything? it seems like the issues it avoids are things the caller can just not do, or do I misunderstand

xanderbailey · 2024-04-10T12:54:01Z

I got around this by under the underlying spark expression rather than using functions so yes there's a work around but it's causing unexpected errors at the moment when using string literals and when used within an array transform.

srowen · 2024-04-10T16:07:23Z

I just released 0.18.0 with this change: https://github.com/databricks/spark-xml/releases/tag/v0.18.0

Fix for xml expression to not parse arbitrary strings

d0235c5

srowen approved these changes Apr 8, 2024

View reviewed changes

srowen merged commit c42d6bc into databricks:master Apr 8, 2024

srowen added this to the 0.18.0 milestone Apr 10, 2024

srowen added the bug label Apr 10, 2024

srowen assigned xanderbailey Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for xml expression to not parse arbitrary strings#679

Fix for xml expression to not parse arbitrary strings#679
srowen merged 1 commit intodatabricks:masterfrom
xanderbailey:xb/fix_xml_sql

xanderbailey commented Apr 6, 2024

Uh oh!

srowen commented Apr 6, 2024

Uh oh!

xanderbailey commented Apr 8, 2024

Uh oh!

xanderbailey commented Apr 10, 2024

Uh oh!

srowen commented Apr 10, 2024

Uh oh!

xanderbailey commented Apr 10, 2024

Uh oh!

srowen commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xanderbailey commented Apr 6, 2024

Uh oh!

srowen commented Apr 6, 2024

Uh oh!

xanderbailey commented Apr 8, 2024

Uh oh!

xanderbailey commented Apr 10, 2024

Uh oh!

srowen commented Apr 10, 2024

Uh oh!

xanderbailey commented Apr 10, 2024

Uh oh!

srowen commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants