-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Output of toJSONString is not recognized as valid JSON #75017
Description
Company or project name
No response
Describe the unexpected behaviour
When a string contains ASCII characters whose code points are >=128, toJSONString returns a string that doesn't pass isValidJSON, and which raises an exception when trying to cast as ::JSON.
For example, the string unhex(hex(255)) is a single-character string, where that character has code point 255.
SELECT toJSONString(unhex(hex(255))) returns "�", which is a pair of quotes surrounding the raw character 0xFF.
SELECT toJSONString(unhex(hex(255))) | jq returns the same string, and exits successfully, so I want to believe this is valid JSON.
However, SELECT isValidJSON(toJSONString(unhex(hex(255)))) returns 0.
Also, SELECT toJSONString(unhex(hex(255)))::JSON raises an exception:
Code: 117. DB::Exception: Cannot parse JSON object here: "�": In scope SELECT CAST(toJSONString(unhex(hex(255))), 'JSON') SETTINGS enable_json_type = 1. (INCORRECT_DATA)
How to reproduce
$ clickhouse local --version
ClickHouse local version 24.12.3.47 (official build).
$ clickhouse local -q 'SELECT toJSONString(unhex(hex(255)))::JSON SETTINGS enable_json_type=1'
<ERROR, pasted below>
See also https://fiddle.clickhouse.com/2acc3321-c402-49dd-b203-cfbf468a38c2
Expected behavior
Either:
toJSONStringescapes the ASCII characters, so the string only contains valid ASCII characters < 128isValidJSONand::JSONaccept strings containing ASCII characters >=128
Error message and/or stacktrace
Code: 117. DB::Exception: Cannot parse JSON object here: "�": In scope SELECT CAST(toJSONString(unhex(hex(255))), 'JSON') SETTINGS enable_json_type = 1. (INCORRECT_DATA)
Additional context
I am working on a project where users provide key-value pairs, and the values can be arbitrary <=260 char strings. I receive these as arrays, so I am using
mapFromArrays(keys, values)::JSON
to convert them to JSON. However, this fails when a string contains these special characters, so I am unable to import the dataset and I cannot make progress.
I'm happy to implement a workaround that can sanitize these input strings, either manually escaping the special characters, or replacing them with nulls, or something else.