Skip to content

Implement read_json and improve JSON parse errors#5992

Merged
Mytherin merged 36 commits intoduckdb:masterfrom
lnkuiper:read_json
Feb 2, 2023
Merged

Implement read_json and improve JSON parse errors#5992
Mytherin merged 36 commits intoduckdb:masterfrom
lnkuiper:read_json

Conversation

@lnkuiper
Copy link
Collaborator

@lnkuiper lnkuiper commented Jan 25, 2023

This PR implements the read_json function, analogous to read_csv. This table function requires specifying the output column names and types, and cannot auto-detect yet (read_json_auto will be added in a future PR).

For example, the read_json / read_ndjson function can be used like so:

create temporary table lineitem as
select *
from read_ndjson('lineitem.json', 
                 columns={l_orderkey: 'INTEGER',
                          l_partkey: 'INTEGER',
                          l_suppkey: 'INTEGER',
                          l_linenumber: 'INTEGER',
                          l_quantity: 'INTEGER',
                          l_extendedprice: 'DECIMAL(15,2)',
                          l_discount: 'DECIMAL(15,2)',
                          l_tax: 'DECIMAL(15,2)',
                          l_returnflag: 'VARCHAR',
                          l_linestatus: 'VARCHAR',
                          l_shipdate: 'DATE',
                          l_commitdate: 'DATE',
                          l_receiptdate: 'DATE',
                          l_shipinstruct: 'VARCHAR',
                          l_shipmode: 'VARCHAR',
                          l_comment: 'VARCHAR'});

Here lineitem.json is a newline-delimited JSON file containing the TPC-H SF1 lineitem table.

I've benchmarked this example query against read_csv with the same parameters. I enabled

SET experimental_parallel_csv=true;

To make the comparison fairer, as read_json is parallel. Here are the results:

read_csv read_json
1.25s 1.26s

As we can see, the results are very close. The CSV file is ~0.7GB, while the JSON file is ~2.0GB, because each JSON record specifies the column names.

I've also implemented projection pushdown, and to benchmark this I run TPC-H Q1 straight on the JSON/CSV file. Here are the results:

CSV JSON
0.86s 0.57s

The CSV reader does not have projection pushdown (yet), so the JSON reader has the edge here. In the future, filter pushdown can be added to the JSON reader as well to save some more conversions.

I've also added some better error reporting when JSON parsing fails. This is easier for newline-delimited JSON, so we get a bit more information in our error:

-- ndjson error
Invalid Input Error: Malformed JSON in file "data/json/unterminated_quotes.ndjson", line 3 at byte 26: unexpected end of data.
-- unstructured json error
Invalid Input Error: Malformed JSON in file "data/json/unterminated_quotes.ndjson" at byte 111: unexpected control character in string.

For newline-delimited JSON we get line number and byte information, and for unstructured JSON we get only the byte within the file where the error occurs. Getting accurate line number information is tricky when scans are parallel, but I think the implementation should cover this too.

Happy to receive feedback!

Edit/Update 1: I've implemented the json_keys function (#5522), and done some performance improvements (numbers above are updated)

Edit/Update 2: I've implemented schema detection in this PR as well. Here's a benchmark/comparison of reading lineitem SF1 as newline-delimited JSON with different systems:

System Median Time [s]
duckdb 1.39
pandas 94.87
pyarrow 1.98
polars 1.96

For pandas/pyarrow/polars I've used the read_json/read_ndjson functions, and for duckdb I've used

create temporary table as select * from 'lineitem.json'

It's not a perfect comparison, but all of the approaches materialize the result fully, so it's reasonable. As a bonus, DuckDB is able to detect that some of the string values in lineitem.json are of type DATE, which the other systems do not.

Laurens Kuiper added 30 commits January 24, 2023 15:59
@Mytherin Mytherin merged commit cbf4343 into duckdb:master Feb 2, 2023
@Mytherin
Copy link
Collaborator

Mytherin commented Feb 2, 2023

Thanks for the fixes! Looks great

@sa1
Copy link

sa1 commented Feb 13, 2023

Some questions, not sure if this PR is the wrong place to ask.
For ndjson, can a malformed line be skipped?
With read_json_auto, can a line which doesn't match inferred schema be skipped or sent to another table?
If schema is provided, is there an attempt to coerce types for some json values?
I don't see an example for nested json data, is specifying a nested schema possible?

I don't require these right now, but I've run across these issues while performing json ingestion in the past, so there can be some design choices around these questions.

@lnkuiper
Copy link
Collaborator Author

@sa1
You can skip malformed JSON by passing ignore_errors=true.
You will have to send it to another table yourself - we can't have conditional inserts into multiple tables.
You can coerce types when supplying parameter columns=....
Nested schemas are possible - we support mapping from JSON object/array to DuckDB STRUCT/LIST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants