Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Fix for xml expression to not parse arbitrary strings#679

Merged
srowen merged 1 commit intodatabricks:masterfrom
xanderbailey:xb/fix_xml_sql
Apr 8, 2024
Merged

Fix for xml expression to not parse arbitrary strings#679
srowen merged 1 commit intodatabricks:masterfrom
xanderbailey:xb/fix_xml_sql

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

Previously it was the case that we would parse the string of the column as the column expression argument for the expression. This leads to being able to execute arbitrary spark SQL if you parse the expression a string literal. It also means that string literals don't work with this expression and it doesn't work within an array transform expression either.

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Apr 6, 2024

This project is now incorporated into Apache Spark. Could you open the pull request there? if it's accepted, I will back-port it here just in case.

Can you give an example of what no longer works (and should not), and/or what didn't work before and does now?

@xanderbailey
Copy link
Copy Markdown
Contributor Author

Thanks for the quick response @srowen! Looks like it's being added to spark 4.0 here https://github.com/apache/spark/blame/29d077fbbd5464f64e0eeb495f7a955850915cc5/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L7146? Reading this path it seems like it isn't using the same constructor as this library so I think we should be good to just back port this here?

I can't think of an example of something that doesn't work now that would have worked before...

Cases that previously didn't work and do now:

from_xml(new Literal("<root><name>foo</name></root>"), schema) <- string literal is read by the SQL parser as sql which is clearly isn't.
functions.transform(new Column("array"), (element, _index) -> from_xml(element, schema)) <- the variable reference is removed here when you use the SQL parser so this throws when planning.

@srowen srowen merged commit c42d6bc into databricks:master Apr 8, 2024
@xanderbailey
Copy link
Copy Markdown
Contributor Author

@srowen is there a way for us to tag a new version for this fix?

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Apr 10, 2024

Possibly, but does this block anything? it seems like the issues it avoids are things the caller can just not do, or do I misunderstand

@xanderbailey
Copy link
Copy Markdown
Contributor Author

I got around this by under the underlying spark expression rather than using functions so yes there's a work around but it's causing unexpected errors at the moment when using string literals and when used within an array transform.

@srowen srowen added this to the 0.18.0 milestone Apr 10, 2024
@srowen srowen added the bug label Apr 10, 2024
@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Apr 10, 2024

I just released 0.18.0 with this change: https://github.com/databricks/spark-xml/releases/tag/v0.18.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants