Implement read_json and improve JSON parse errors#5992
Implement read_json and improve JSON parse errors#5992Mytherin merged 36 commits intoduckdb:masterfrom
Conversation
|
Thanks for the fixes! Looks great |
|
Some questions, not sure if this PR is the wrong place to ask. I don't require these right now, but I've run across these issues while performing json ingestion in the past, so there can be some design choices around these questions. |
|
@sa1 |
This PR implements the
read_jsonfunction, analogous toread_csv. This table function requires specifying the output column names and types, and cannot auto-detect yet (read_json_autowill be added in a future PR).For example, the
read_json/read_ndjsonfunction can be used like so:Here
lineitem.jsonis a newline-delimited JSON file containing the TPC-H SF1 lineitem table.I've benchmarked this example query against
read_csvwith the same parameters. I enabledTo make the comparison fairer, as
read_jsonis parallel. Here are the results:read_csvread_jsonAs we can see, the results are very close. The CSV file is ~0.7GB, while the JSON file is ~2.0GB, because each JSON record specifies the column names.
I've also implemented projection pushdown, and to benchmark this I run TPC-H Q1 straight on the JSON/CSV file. Here are the results:
The CSV reader does not have projection pushdown (yet), so the JSON reader has the edge here. In the future, filter pushdown can be added to the JSON reader as well to save some more conversions.
I've also added some better error reporting when JSON parsing fails. This is easier for newline-delimited JSON, so we get a bit more information in our error:
For newline-delimited JSON we get line number and byte information, and for unstructured JSON we get only the byte within the file where the error occurs. Getting accurate line number information is tricky when scans are parallel, but I think the implementation should cover this too.
Happy to receive feedback!
Edit/Update 1: I've implemented the
json_keysfunction (#5522), and done some performance improvements (numbers above are updated)Edit/Update 2: I've implemented schema detection in this PR as well. Here's a benchmark/comparison of reading lineitem SF1 as newline-delimited JSON with different systems:
For pandas/pyarrow/polars I've used the
read_json/read_ndjsonfunctions, and for duckdb I've usedIt's not a perfect comparison, but all of the approaches materialize the result fully, so it's reasonable. As a bonus, DuckDB is able to detect that some of the string values in
lineitem.jsonare of typeDATE, which the other systems do not.