Speed up JSON schema inference by ~2.8x by Rafferty97 · Pull Request #9494 · apache/arrow-rs

Rafferty97 · 2026-02-28T12:44:49Z

Which issue does this PR close?

This PR fixes #9484, and also sets the groundwork for implementing #9482. It also delivers an approximate 2.8x speed to JSON schema inference.

I have refactored the code that infers the schema of JSON sources, specifically:

Simplify the type inference logic, removing special cases
Schema inference now consumes TapeDecoder, eliminating the need to materialise rows into serde_json::Values first
Use arena allocation for efficiency
Remove scalar-to-array coersion, as the actual JSON reader doesn't support it
Move ValueIter into its own module

Rationale for this change

While working on #9482, I saw a need and opportunity to refactor the schema inference code for JSON schemas. I also discovered the bug detailed in #9484.

These changes not only make the code more readible and predictable by eliminating a lot of special case handling, but make it trivial to create a new inference function for "single field" JSON reading.

They have also provided a significant performance boost to the schema inference functions. I added a simple benchmark for infer_json_schema, which yielded the following results on my machine, reflecting an approx. 2.8x speed up:

Before changes:
infer_json_schema/1000 time: [1.4443 ms 1.4616 ms 1.4793 ms]
thrpt: [85.336 MiB/s 86.366 MiB/s 87.401 MiB/s]

After changes:
infer_json_schema/1000 time: [517.79 µs 519.10 µs 520.54 µs]
thrpt: [242.51 MiB/s 243.18 MiB/s 243.80 MiB/s]
change:
time: [−64.919% −64.485% −64.043%] (p = 0.00 < 0.05)
thrpt: [+178.11% +181.57% +185.06%]

What changes are included in this PR?

At a glance:

An overhaul of arrow-json/src/reader/schema.rs
Removed mixed_arrays.json as it's no longer valid, and replaced mixed_arrays.json.gz with arrays.json.gz
Added a dependency on Bumpalo for arena allocation

Because this is a somewhat sizeable PR, I've done my best to break into a logical sequence of commits to hopefully assist with the review.

Are these changes tested?

Yes, the changes pass all existing unit tests - except for one intentionally removed due to the change in behaviour related to #9484 (removing scalar-to-array promotion).

I have also added an additional benchmark for the schema inference performance.

Are there any user-facing changes?

There are no API changes, except for the addition of the record_count method on ValueIter.

However, the error messages returned by infer_json_schema and its cousins will significantly change, with most of them condensed to a single "Expected {expected}, found {got}" template.

Finally, some files that used to generate a valid schema will now return errors. However, this is desirable because those files would have failed to be read by the actual JSON reader anyway - due to the lack of support for scalar-to-array promotion in the JSON reader. (See #9484)

# Which issue does this PR close? Split out from #9494 to make review easier. It simply adds a benchmark for JSON schema inference. # Rationale for this change I have an open PR that significantly refactors the JSON schema inference code, so I want confidence that not only is the new code correct, but also has better performance than the existing code. # What changes are included in this PR? Adds a benchmark. # Are these changes tested? N/A # Are there any user-facing changes? No

arrow-json/src/reader/value_iter.rs

…ion (#9557) # Which issue does this PR close? Another smaller PR extracted from #9494. # Rationale for this change I've moved `ValueIter` into its own module because it's already self-contained, and because that will make it easier to review the changes I have made to `arrow-json/src/reader/schema.rs`. I've also added a public `record_count` function to `ValueIter` - which can be used to simplify consuming code in Datafusion which is currently tracking it separately. # What changes are included in this PR? * Moved `ValueIter` into own module * Added `record_count` method to `ValueIter` # Are these changes tested? Yes. # Are there any user-facing changes? Addition of one new public method, `ValueIter::record_count`.

alamb · 2026-03-20T14:38:15Z

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

(`infer.rs`); remove obsolete tests

"scalar-to-array promotion", and adjust tests.

the need to parse rows into `serde_json::Value`s first.

Rafferty97 · 2026-03-20T14:56:00Z

@Rafferty97, can you please merge up this PR to resolve the conflicts and then we can run the benchmarks again to confirm the results

Done :)

alamb · 2026-03-20T15:18:49Z

infer_json_schema/1000 1.00 733.2±1.71µs 172.2 MB/sec 2.11 1547.1±11.79µs 81.6 MB/sec

That is certainly a nice result ❤️

alamb

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)
Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

arrow-json/src/reader/schema/infer.rs

alamb · 2026-03-20T15:07:15Z

arrow-json/test/data/mixed_arrays.json

@@ -1,4 +0,0 @@
-{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}


What is the purpose of removing this file?

This file was used by tests related to an inference rule that coerces a mix of scalar and array values into an array type. I've removed this rule because the JSON reader can't actually do this coercion, so I figured it was better to error out instead.

I could reinstate these files and test that they cause schema inference to fail - but I'm unsure how useful that actually is?

alamb · 2026-03-20T15:07:47Z

arrow-json/test/data/arrays.json.gz

Given this file is so small (133 bytes), can you please unzip it to make the contents more explicit and easier to review and track changes

The contents are identical to arrays.json. I was following the pattern set by mixed_arrays.json(.gz). I needed to create this file for tests that previously used mixed_arrays.json(.gz) which were deleted. Those files were deleted because they aren't readable by the JSON reader - they rely on coercion semantics that no longer exist.

arrow-json/src/reader/schema/infer.rs

alamb · 2026-03-20T15:15:05Z

arrow-json/src/reader/schema.rs

        assert_eq!(small_field.data_type(), &DataType::Float64);
    }

-    #[test]


why is this test removed?

These tests pertain to coercion logic that I've removed, due to them being inconsistent with the JSON reader, which is incapable of doing these coercions.

arrow-json/src/reader/schema.rs

alamb · 2026-03-20T15:18:08Z

arrow-json/src/reader/schema/infer.rs

+}
+
+/// The type of a JSON value
+pub enum JsonType {


I found it strange that the Json type and tape value are now in the infer module -- they seem more widely applicable than just for schema inference

That's fair. I've moved them to a separate module within schema for better code organisation.

Rafferty97 · 2026-03-21T00:13:46Z

Thanks again for this @Rafferty97 and for your patience

I took a look at the PR. my major comments are:

Can you please document the design / rationale (and why are there LazyLocks being used for what seem to be very small enums)

Can you ensure the behavior is the same as the existing code?

If we want to change the inference behavior I recommend proposing those changes in a separate PR so that we can evaluate the potential impact.

Hi @alamb, thank you for taking a look over the PR and for the detailed feedback.

The LazyLocks are an optimisation to avoid allocating a bunch of identical Arcs for the primitive types. You're right that this warrants some explanatory comments.

The behaviour intentionally diverges from the existing code, because the existing code would perform coercions that the actual JSON reader itself doesn't do. So, when such a JSON file is encountered, the previous code would infer successfully but the actual reading into record batches would fail. This new code would return an error at inference time, which I think is more useful and less surprising to the end user.

refactor `infer_json_type` into helper functions.

Rafferty97 · 2026-03-25T04:37:16Z

@alamb This one's ready for review again :)

I've removed bumpalo and just used Arcs instead. I've also cleaned up the use of lazy locks and added comments where appropriate.

If you're concerned about the removal of the scalar-to-array coercion logic, I can add it back in, but I think it's better not to unless we plan to implement that coercion logic in the JSON reader itself.

adriangbot · 2026-04-01T21:58:06Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4173202098-654-4hztd 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing json-schema (469b437) to 322f9ce (merge-base) diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-01T21:59:11Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4173202263-655-nbd4w 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing json-schema (469b437) to 322f9ce (merge-base) diff
BENCH_NAME=json_reader
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench json_reader
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-01T22:09:21Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                        json-schema                            main
-----                                        -----------                            ----
decode_binary_hex_json                       1.00     13.7±0.16ms        ? ?/sec    1.01     13.8±0.17ms        ? ?/sec
decode_binary_view_hex_json                  1.01     14.5±0.08ms        ? ?/sec    1.00     14.4±0.13ms        ? ?/sec
decode_fixed_binary_hex_json                 1.01     14.0±0.08ms        ? ?/sec    1.00     13.9±0.13ms        ? ?/sec
decode_list_long_i64_json/131072             1.00    305.3±2.05ms   256.5 MB/sec    1.01    309.2±2.13ms   253.3 MB/sec
decode_list_long_i64_serialize               1.00    189.6±5.84ms        ? ?/sec    1.02    193.5±6.16ms        ? ?/sec
decode_list_short_i64_json/131072            1.00     19.8±0.04ms   264.1 MB/sec    1.01     19.9±0.03ms   262.0 MB/sec
decode_list_short_i64_serialize              1.00     11.6±0.61ms        ? ?/sec    1.02     11.9±0.73ms        ? ?/sec
decode_wide_object_i64_json                  1.00    470.6±7.98ms        ? ?/sec    1.02    481.6±6.34ms        ? ?/sec
decode_wide_object_i64_serialize             1.00   441.5±16.55ms        ? ?/sec    1.01   445.0±15.87ms        ? ?/sec
decode_wide_projection_full_json/131072      1.00   782.7±10.98ms   222.3 MB/sec    1.02   797.8±10.18ms   218.1 MB/sec
decode_wide_projection_narrow_json/131072    1.01    450.2±2.28ms   386.5 MB/sec    1.00    446.2±2.47ms   390.0 MB/sec
infer_json_schema/1000                       1.00    780.4±2.35µs   161.7 MB/sec    1.99  1551.3±11.73µs    81.4 MB/sec
large_bench_primitive                        1.00   1527.9±2.28µs        ? ?/sec    1.00   1530.6±3.86µs        ? ?/sec
small_bench_list                             1.01      8.1±0.02µs        ? ?/sec    1.00      8.0±0.01µs        ? ?/sec
small_bench_primitive                        1.01      4.5±0.01µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec
small_bench_primitive_with_utf8view          1.02      4.5±0.02µs        ? ?/sec    1.00      4.5±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	306.5s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	288.9s
CPU sys	17.4s
Disk read	4.0 KiB
Disk write	617.7 MiB

branch

Metric	Value
Wall time	308.1s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	291.1s
CPU sys	16.9s
Disk read	0 B
Disk write	980.0 KiB

File an issue against this benchmark runner

adriangbot · 2026-04-01T22:09:46Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                        json-schema                            main
-----                                        -----------                            ----
decode_binary_hex_json                       1.00     13.8±0.02ms        ? ?/sec    1.01     14.0±0.03ms        ? ?/sec
decode_binary_view_hex_json                  1.01     14.4±0.03ms        ? ?/sec    1.00     14.3±0.02ms        ? ?/sec
decode_fixed_binary_hex_json                 1.02     14.1±0.02ms        ? ?/sec    1.00     13.9±0.03ms        ? ?/sec
decode_list_long_i64_json/131072             1.00    303.4±0.33ms   258.1 MB/sec    1.01    307.6±0.45ms   254.5 MB/sec
decode_list_long_i64_serialize               1.00    184.7±4.69ms        ? ?/sec    1.02    189.2±4.45ms        ? ?/sec
decode_list_short_i64_json/131072            1.00     19.8±0.04ms   264.0 MB/sec    1.01     19.9±0.02ms   262.2 MB/sec
decode_list_short_i64_serialize              1.00     11.2±0.17ms        ? ?/sec    1.03     11.6±0.17ms        ? ?/sec
decode_wide_object_i64_json                  1.00    466.4±5.38ms        ? ?/sec    1.03    480.0±4.39ms        ? ?/sec
decode_wide_object_i64_serialize             1.01   432.7±13.65ms        ? ?/sec    1.00   428.8±13.08ms        ? ?/sec
decode_wide_projection_full_json/131072      1.07   860.3±99.44ms   202.2 MB/sec    1.00    801.0±6.84ms   217.2 MB/sec
decode_wide_projection_narrow_json/131072    1.01    449.4±0.30ms   387.2 MB/sec    1.00    444.6±0.38ms   391.4 MB/sec
infer_json_schema/1000                       1.00    781.5±3.03µs   161.5 MB/sec    2.01  1572.4±33.64µs    80.3 MB/sec
large_bench_primitive                        1.00   1528.0±2.39µs        ? ?/sec    1.01   1539.4±5.47µs        ? ?/sec
small_bench_list                             1.00      8.0±0.02µs        ? ?/sec    1.02      8.1±0.05µs        ? ?/sec
small_bench_primitive                        1.00      4.5±0.03µs        ? ?/sec    1.00      4.5±0.03µs        ? ?/sec
small_bench_primitive_with_utf8view          1.00      4.5±0.02µs        ? ?/sec    1.00      4.5±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	310.5s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	292.2s
CPU sys	18.1s
Disk read	0 B
Disk write	1.5 GiB

branch

Metric	Value
Wall time	313.4s
Peak memory	3.5 GiB
Avg memory	2.9 GiB
CPU user	296.3s
CPU sys	16.9s
Disk read	0 B
Disk write	2.0 MiB

File an issue against this benchmark runner

github-actions bot added the arrow Changes to the arrow crate label Feb 28, 2026

Rafferty97 force-pushed the json-schema branch from 0dad3c7 to 0b881ac Compare February 28, 2026 13:11

Rafferty97 changed the title ~~Refactor and improve performance of JSON schema inference~~ Speed up JSON schema inference by ~2.8x Mar 2, 2026

Rafferty97 mentioned this pull request Mar 4, 2026

Support JSON arrays reader/parse for datafusion apache/datafusion#19924

Merged

Rafferty97 mentioned this pull request Mar 12, 2026

Add benchmark for infer_json_schema #9546

Merged

Rafferty97 force-pushed the json-schema branch from f398040 to 6d256ce Compare March 12, 2026 23:53

Rafferty97 force-pushed the json-schema branch from 6d256ce to b4f2488 Compare March 13, 2026 14:12

Rafferty97 mentioned this pull request Mar 14, 2026

Move ValueIter into own module, and add public record_count function #9557

Merged

Rafferty97 force-pushed the json-schema branch from b4f2488 to 49d5016 Compare March 14, 2026 00:24

Dandandan mentioned this pull request Mar 16, 2026

Parallelize infer_schema apache/datafusion#19970

Open

Dandandan reviewed Mar 18, 2026

View reviewed changes

arrow-json/src/reader/value_iter.rs Show resolved Hide resolved

Rafferty97 added 6 commits March 21, 2026 01:52

Rewrite JSON schema inference logic, and move it into its own module

88fc3f4

(`infer.rs`); remove obsolete tests

Remove "mixed_arrays.json" as the reader can't actually handle

27b07c3

"scalar-to-array promotion", and adjust tests.

Re-implement infer_json_schema to make use of TapeDecoder, removing

8d99f99

the need to parse rows into `serde_json::Value`s first.

Fix licenses, clippy fixes, remove unused indexmap crate

e82462a

Removed bumpalo, and use Arcs instead

52f2139

Fix clippy errors

1b5d16b

Rafferty97 force-pushed the json-schema branch from be0fdf5 to 1b5d16b Compare March 20, 2026 14:52

This comment has been minimized.

Sign in to view

Rafferty97 mentioned this pull request Mar 20, 2026

Change occurances of truncate(0) to clear #9593

Closed

This comment has been minimized.

Sign in to view

alamb reviewed Mar 20, 2026

View reviewed changes

PR feedback and refactoring

6ee6985

Rafferty97 added 3 commits March 21, 2026 12:54

Make InferTy an enum directly rather than a newtyped Arc, and

cfba988

refactor `infer_json_type` into helper functions.

Memoize common types

b5df8ca

Fix accidental const instead of static

469b437

This comment has been minimized.

Sign in to view

		@@ -1,4 +0,0 @@
		{"a":1, "b":[2.0, 1.3, -6.1], "c":[false, true], "d":4.1}

Conversation

Rafferty97 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

alamb commented Mar 20, 2026

Uh oh!

Rafferty97 commented Mar 20, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

alamb commented Mar 20, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rafferty97 commented Mar 21, 2026

Uh oh!

Rafferty97 commented Mar 25, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

adriangbot commented Apr 1, 2026

Uh oh!

adriangbot commented Apr 1, 2026

Uh oh!

adriangbot commented Apr 1, 2026

Uh oh!

adriangbot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rafferty97 commented Feb 28, 2026 •

edited

Loading