Fix JSON parser to allow control characters in JSON string input#11433
Fix JSON parser to allow control characters in JSON string input#11433philo-he wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
901ea92 to
960f35d
Compare
|
Hi @philo-he , I am working on canonicalization of jsons (#11284); Ideally we aim to handle escaping during canonicalization (i.e from json_parse or when say cast(x to json)). |
|
@kgpai, if simdjson accepts the proposal for JSON canonicalization or some other required functionalities like this, I think it would be better. |
|
You cannot have an unescaped line feed character in a string in JSON.
https://www.rfc-editor.org/rfc/rfc8259 Here is what you get in JavaScript:
Here is with the example from the issue:
|
|
@lemire, thank you so much for your feedback! It appears that Spark's or Presto's JSON parser doesn't strictly meet JSON standard. I understand this proposed code change cannot be accepted by simdjson lib. @kgpai, do you have any thought? Should we create some patch file in Velox to fix the inconsistency like what this pr does? |
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
I never answered this comment. That's not correct. We have secret and undocumented extensions within simdjson that are similar to what you describe. What we won't do is change the default behaviour which is tied to the JSON specification. However, we allow 'secret' customization. Of course, before you submit a patch to simdjson (which would need to be guarded my macros and disabled by default), you have to think hard: do you really want to deviate from the specification ? As I have illustrated, your requested extensions won't pass in standard systems like JavaScript runtimes. It seems to me that you are dealing with an actual bug. Someone is serializing data to JSON improperly. Instead of propagating the issue, wouldn't it be simpler to file an issue ? Who is generating this broken JSON ? |
@lemire, we need to ensure that the results align with the Spark JSON parser, even when handling malformed JSON. Many of our users cannot control the data producer to prevent such JSON data. |
|
I have a different take on the question...
To reiterate, there are hidden/undocumented ways in simdjson to support broken JSON, but for the reasons that I make clear in the article, I strongly discourage it. |
@lemire, thanks for sharing this insightful article! I will take more considerations and possibly have a discussion with Spark community to align on handling such malformed JSON. |
|
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |


JSON string input can contain control characters, e.g., new line character
\n. It can becorrectly handled by Presto, but not allowed by simdjson lib. If it exists in input, simdjson
will return
UNESCAPED_CHARSerror, then Velox will output null. Simdjson only allows literal\n(represented by\\n). See its code link.Here is a test result with presto.
Just created one discussion thread in simdjson community: simdjson/simdjson#2287.