Implement read_json and improve JSON parse errors by lnkuiper · Pull Request #5992 · duckdb/duckdb

lnkuiper · 2023-01-25T12:53:17Z

This PR implements the read_json function, analogous to read_csv. This table function requires specifying the output column names and types, and cannot auto-detect yet (read_json_auto will be added in a future PR).

For example, the read_json / read_ndjson function can be used like so:

create temporary table lineitem as
select *
from read_ndjson('lineitem.json', 
                 columns={l_orderkey: 'INTEGER',
                          l_partkey: 'INTEGER',
                          l_suppkey: 'INTEGER',
                          l_linenumber: 'INTEGER',
                          l_quantity: 'INTEGER',
                          l_extendedprice: 'DECIMAL(15,2)',
                          l_discount: 'DECIMAL(15,2)',
                          l_tax: 'DECIMAL(15,2)',
                          l_returnflag: 'VARCHAR',
                          l_linestatus: 'VARCHAR',
                          l_shipdate: 'DATE',
                          l_commitdate: 'DATE',
                          l_receiptdate: 'DATE',
                          l_shipinstruct: 'VARCHAR',
                          l_shipmode: 'VARCHAR',
                          l_comment: 'VARCHAR'});

Here lineitem.json is a newline-delimited JSON file containing the TPC-H SF1 lineitem table.

I've benchmarked this example query against read_csv with the same parameters. I enabled

SET experimental_parallel_csv=true;

To make the comparison fairer, as read_json is parallel. Here are the results:

`read_csv`	`read_json`
1.25s	1.26s

As we can see, the results are very close. The CSV file is ~0.7GB, while the JSON file is ~2.0GB, because each JSON record specifies the column names.

I've also implemented projection pushdown, and to benchmark this I run TPC-H Q1 straight on the JSON/CSV file. Here are the results:

CSV	JSON
0.86s	0.57s

The CSV reader does not have projection pushdown (yet), so the JSON reader has the edge here. In the future, filter pushdown can be added to the JSON reader as well to save some more conversions.

I've also added some better error reporting when JSON parsing fails. This is easier for newline-delimited JSON, so we get a bit more information in our error:

-- ndjson error
Invalid Input Error: Malformed JSON in file "data/json/unterminated_quotes.ndjson", line 3 at byte 26: unexpected end of data.
-- unstructured json error
Invalid Input Error: Malformed JSON in file "data/json/unterminated_quotes.ndjson" at byte 111: unexpected control character in string.

For newline-delimited JSON we get line number and byte information, and for unstructured JSON we get only the byte within the file where the error occurs. Getting accurate line number information is tricky when scans are parallel, but I think the implementation should cover this too.

Happy to receive feedback!

Edit/Update 1: I've implemented the json_keys function (#5522), and done some performance improvements (numbers above are updated)

Edit/Update 2: I've implemented schema detection in this PR as well. Here's a benchmark/comparison of reading lineitem SF1 as newline-delimited JSON with different systems:

System	Median Time [s]
duckdb	1.39
pandas	94.87
pyarrow	1.98
polars	1.96

For pandas/pyarrow/polars I've used the read_json/read_ndjson functions, and for duckdb I've used

create temporary table as select * from 'lineitem.json'

It's not a perfect comparison, but all of the approaches materialize the result fully, so it's reasonable. As a bonus, DuckDB is able to detect that some of the string values in lineitem.json are of type DATE, which the other systems do not.

Mytherin · 2023-02-02T20:58:52Z

Thanks for the fixes! Looks great

sa1 · 2023-02-13T17:55:40Z

Some questions, not sure if this PR is the wrong place to ask.
For ndjson, can a malformed line be skipped?
With read_json_auto, can a line which doesn't match inferred schema be skipped or sent to another table?
If schema is provided, is there an attempt to coerce types for some json values?
I don't see an example for nested json data, is specifying a nested schema possible?

I don't require these right now, but I've run across these issues while performing json ingestion in the past, so there can be some design choices around these questions.

lnkuiper · 2023-02-14T10:14:16Z

@sa1
You can skip malformed JSON by passing ignore_errors=true.
You will have to send it to another table yourself - we can't have conditional inserts into multiple tables.
You can coerce types when supplying parameter columns=....
Nested schemas are possible - we support mapping from JSON object/array to DuckDB STRUCT/LIST.

Laurens Kuiper added 30 commits January 24, 2023 15:59

implement read_json and some initial work on read_json_auto

a343832

implement projection pushdown

a3dc9ff

Merge branch 'master' into read_json

e0b8403

proper line/byte information when an error occurs

ccc672b

remove 'auto' function for now, that's for a future PR

3305391

Merge branch 'master' into read_json

7f2ef78

remove unique_ptr/new and use json allocator instead

ff02af9

implement json_keys and get rid of some string copies

7ee1620

performance improvements to json_transform

8363195

Merge branch 'master' into read_json

48f9d22

add std:: to some moves

11fb010

back to old implementation - more robust

60e7759

trying to fix CI

074d35a

Merge branch 'master' into read_json

347b21e

re-write json_structure, should never fail

168f758

finishing touches before PR

d452149

make batch_index atomic and improve error for unstructured json

bb15e86

Merge branch 'master' into read_json

1519025

use json_key_map_t to speed up json_structure

c89bab1

use JSONTransformOptions instead of lots of bools

045ef50

wrapped json global/local scan states

2ab74b9

replacement scans for JSON working :)

ec2d59f

Merge branch 'master' into read_json

5a999e2

scan resetting for reading from stream

3148266

string type detection - progress but still need to handle date formats

4b4951d

Merge branch 'master' into read_json

592010b

add some JSON tests

252ee5d

default to unlimited maximum_depth

020eb55

Merge branch 'master' into read_json

d404b20

support detection of all date formats that our CSV reader also supports

26468d5

Laurens Kuiper added 6 commits February 1, 2023 15:49

Merge branch 'master' into read_json

f6b4852

fix json auto detect shell test

1622e0e

some DUCKDB_API for windows and add new extension functions to header

50441ca

merge conflict with includes

81a8626

fix bug in date/timestamp detection

a5d06bc

Merge branch 'master' into read_json

9bfc0b4

Mytherin merged commit cbf4343 into duckdb:master Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement read_json and improve JSON parse errors#5992

Implement read_json and improve JSON parse errors#5992
Mytherin merged 36 commits intoduckdb:masterfrom
lnkuiper:read_json

lnkuiper commented Jan 25, 2023 •

edited

Loading

Uh oh!

Mytherin commented Feb 2, 2023

Uh oh!

sa1 commented Feb 13, 2023

Uh oh!

lnkuiper commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lnkuiper commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin commented Feb 2, 2023

Uh oh!

sa1 commented Feb 13, 2023

Uh oh!

lnkuiper commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lnkuiper commented Jan 25, 2023 •

edited

Loading