Integrate caching filesystem into json reader#20575
Integrate caching filesystem into json reader#20575lnkuiper merged 9 commits intoduckdb:v1.5-variegatafrom
Conversation
|
Thanks for the PR! I've done a quick test and this works well :) Do you think the bugfix needs to be in the v1.5 release? If so, we should consider targeting |
90a52f3 to
51920de
Compare
51920de to
d5fe3f9
Compare
Thanks for the quick reply! I rebase the base branch to be v1.5:
|
|
This test in httpfs tests how many GET requests are being done: Could you send a patch to fix this? @dentiny EDIT: Looks like it now performs 1 less GET :D |
Definitely, two changes made:
HTTP requests made with cached json reader: memory D SELECT request FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.026982+00', 'duration_ms': 40, │
│ 'headers': {Range='bytes=0-1048575', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.102178+00', 'duration_ms': 47, │
│ 'headers': {Range='bytes=1048576-3145727', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.371647+00', 'duration_ms': 74, │
│ 'headers': {Range='bytes=3145728-7340031', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:32.328241+00', 'duration_ms': 142, │
│ 'headers': {Range='bytes=7340032-15728639', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:32.927187+00', 'duration_ms': 320, │
│ 'headers': {Range='bytes=15728640-32505855', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:36.28581+00', 'duration_ms': 529, │
│ 'headers': {Range='bytes=32505856-66060287', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:38.340129+00', 'duration_ms': 499, │
│ 'headers': {Range='bytes=66060288-99614719', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:40.021854+00', 'duration_ms': 126, │
│ 'headers': {Range='bytes=99614720-105159016', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘HTTP requests made without cached json reader: memory D SELECT request FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ request │
│ struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar)) │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.377027+00', 'duration_ms': 41, │
│ 'headers': {Range='bytes=0-1048575', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.432911+00', 'duration_ms': 46, │
│ 'headers': {Range='bytes=1048576-3145727', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.515931+00', 'duration_ms': 71, │
│ 'headers': {Range='bytes=3145728-7340031', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.034297+00', 'duration_ms': 127, │
│ 'headers': {Range='bytes=7340032-15728639', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.481478+00', 'duration_ms': 284, │
│ 'headers': {Range='bytes=15728640-32505855', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.800335+00', 'duration_ms': 232, │
│ 'headers': {Range='bytes=0-16777215', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:26.553262+00', 'duration_ms': 575, │
│ 'headers': {Range='bytes=16777216-50331647', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:28.204751+00', 'duration_ms': 508, │
│ 'headers': {Range='bytes=50331648-83886079', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ { │
│ 'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:29.8826+00', 'duration_ms': 303, │
│ 'headers': {Range='bytes=83886080-105159016', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'} │
│ } │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
|
Awesome, thanks! |
a8e04d2 to
c085b2a
Compare
|
The failed assertion comes from ABORT THROWN BY INTERNAL EXCEPTION: Assertion triggered in file "/home/runner/work/duckdb/duckdb/src/storage/compression/validity_uncompressed.cpp" on line 229: result_mask.RowIsValid(result_offset + i)
build/relassert/test/unittest(+0x14058db2) [0x55b2bc103db2]
<omit others>Not seems to be related to my change :( |
|
There have been some issues in CI - nothing to worry about though. Could you merge with |
…on-reader-cache-integration
The test fails on CSV read, which I don't even touch in the PR, I will take a look. |
|
This is unrelated, I've seen it break in other PRs. Thanks for all the changes, this can be merged :) |
Date: 2026-02-02 13:44:38 +0100 Integrate caching filesystem into json reader (duckdb/duckdb#20575) [chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772) Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
Date: 2026-02-02 13:44:38 +0100 Integrate caching filesystem into json reader (duckdb/duckdb#20575) [chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772) Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
Date: 2026-02-02 13:44:38 +0100 Integrate caching filesystem into json reader (duckdb/duckdb#20575) [chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772) Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
Date: 2026-02-02 13:44:38 +0100 Integrate caching filesystem into json reader (duckdb/duckdb#20575) [chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772) Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
This PR does two things: