Skip to content

Fix JSON parser to allow control characters in JSON string input#11433

Closed
philo-he wants to merge 1 commit intofacebookincubator:mainfrom
philo-he:fix-escaped-char
Closed

Fix JSON parser to allow control characters in JSON string input#11433
philo-he wants to merge 1 commit intofacebookincubator:mainfrom
philo-he:fix-escaped-char

Conversation

@philo-he
Copy link
Copy Markdown
Contributor

@philo-he philo-he commented Nov 5, 2024

JSON string input can contain control characters, e.g., new line character \n. It can be
correctly handled by Presto, but not allowed by simdjson lib. If it exists in input, simdjson
will return UNESCAPED_CHARS error, then Velox will output null. Simdjson only allows literal
\n (represented by \\n). See its code link.

Here is a test result with presto.

SELECT json_extract('{"c1":"ab\ncd"}', '$.c1');
  _col0
----------
 "ab\ncd"
(1 row)

Just created one discussion thread in simdjson community: simdjson/simdjson#2287.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2024
@netlify
Copy link
Copy Markdown

netlify Bot commented Nov 5, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 960f35d
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6729afcfa84e260008816259

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented Nov 7, 2024

Hi @philo-he , I am working on canonicalization of jsons (#11284); Ideally we aim to handle escaping during canonicalization (i.e from json_parse or when say cast(x to json)).
Looking into the example you raised, I am wondering if this is another way we should handle canonicalization (cc: @gggrace14 )

@philo-he
Copy link
Copy Markdown
Contributor Author

philo-he commented Nov 8, 2024

@kgpai, if simdjson accepts the proposal for JSON canonicalization or some other required functionalities like this, I think it would be better.

@lemire
Copy link
Copy Markdown

lemire commented Jan 15, 2025

You cannot have an unescaped line feed character in a string in JSON.

The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).

https://www.rfc-editor.org/rfc/rfc8259

Here is what you get in JavaScript:

Screenshot 2025-01-14 at 9 25 25 PM

Here is with the example from the issue:

Screenshot 2025-01-14 at 9 26 40 PM

@philo-he
Copy link
Copy Markdown
Contributor Author

@lemire, thank you so much for your feedback!

It appears that Spark's or Presto's JSON parser doesn't strictly meet JSON standard. I understand this proposed code change cannot be accepted by simdjson lib. @kgpai, do you have any thought? Should we create some patch file in Velox to fix the inconsistency like what this pr does?

@stale
Copy link
Copy Markdown

stale Bot commented May 19, 2025

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@lemire
Copy link
Copy Markdown

lemire commented Jul 3, 2025

@philo-he

I understand this proposed code change cannot be accepted by simdjson lib.

I never answered this comment.

That's not correct. We have secret and undocumented extensions within simdjson that are similar to what you describe. What we won't do is change the default behaviour which is tied to the JSON specification. However, we allow 'secret' customization. Of course, before you submit a patch to simdjson (which would need to be guarded my macros and disabled by default), you have to think hard: do you really want to deviate from the specification ?

As I have illustrated, your requested extensions won't pass in standard systems like JavaScript runtimes. It seems to me that you are dealing with an actual bug. Someone is serializing data to JSON improperly.

Instead of propagating the issue, wouldn't it be simpler to file an issue ? Who is generating this broken JSON ?

@philo-he
Copy link
Copy Markdown
Contributor Author

philo-he commented Jul 4, 2025

As I have illustrated, your requested extensions won't pass in standard systems like JavaScript runtimes. It seems to me that you are dealing with an actual bug. Someone is serializing data to JSON improperly.
Instead of propagating the issue, wouldn't it be simpler to file an issue ? Who is generating this broken JSON ?

@lemire, we need to ensure that the results align with the Spark JSON parser, even when handling malformed JSON. Many of our users cannot control the data producer to prevent such JSON data.

@lemire
Copy link
Copy Markdown

lemire commented Jul 4, 2025

@philo-he

I have a different take on the question...

To reiterate, there are hidden/undocumented ways in simdjson to support broken JSON, but for the reasons that I make clear in the article, I strongly discourage it.

@philo-he
Copy link
Copy Markdown
Contributor Author

@philo-he

I have a different take on the question...

To reiterate, there are hidden/undocumented ways in simdjson to support broken JSON, but for the reasons that I make clear in the article, I strongly discourage it.

@lemire, thanks for sharing this insightful article! I will take more considerations and possibly have a discussion with Spark community to align on handling such malformed JSON.

@stale
Copy link
Copy Markdown

stale Bot commented Oct 11, 2025

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants