Skip to content

Integrate caching filesystem into json reader#20575

Merged
lnkuiper merged 9 commits intoduckdb:v1.5-variegatafrom
dentiny:hjiang/json-reader-cache-integration
Feb 2, 2026
Merged

Integrate caching filesystem into json reader#20575
lnkuiper merged 9 commits intoduckdb:v1.5-variegatafrom
dentiny:hjiang/json-reader-cache-integration

Conversation

@dentiny
Copy link
Contributor

@dentiny dentiny commented Jan 18, 2026

This PR does two things:

  • Fix a bug in caching filesystem, that it should truncate if requested bytes to read exceeds file size
  • Integrate caching filesystem into json reader, so remote json file could be cached as well

@lnkuiper
Copy link
Collaborator

Thanks for the PR! I've done a quick test and this works well :) Do you think the bugfix needs to be in the v1.5 release? If so, we should consider targeting v1.5-variegata. If not, we can just merge this into main and release this in v1.6.

@lnkuiper lnkuiper self-requested a review January 20, 2026 08:37
@dentiny dentiny force-pushed the hjiang/json-reader-cache-integration branch from 90a52f3 to 51920de Compare January 20, 2026 09:13
@dentiny dentiny force-pushed the hjiang/json-reader-cache-integration branch from 51920de to d5fe3f9 Compare January 20, 2026 09:15
@dentiny dentiny changed the base branch from main to v1.5-variegata January 20, 2026 09:16
@dentiny
Copy link
Contributor Author

dentiny commented Jan 20, 2026

Do you think the bugfix needs to be in the v1.5 release? If so, we should consider targeting v1.5-variegata. If not, we can just merge this into main and release this in v1.6.

Thanks for the quick reply!

I rebase the base branch to be v1.5:

  • Bug fix should be fixed, there could be extension relying on caching filesystem
  • Caching is a good feature for json read

@dentiny dentiny marked this pull request as draft January 20, 2026 09:18
@dentiny dentiny marked this pull request as ready for review January 20, 2026 09:18
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 20, 2026 09:59
@Mytherin Mytherin marked this pull request as ready for review January 20, 2026 10:52
@lnkuiper
Copy link
Collaborator

lnkuiper commented Jan 21, 2026

This test in httpfs tests how many GET requests are being done:

test/sql/json/table/internal_issue_6807.test_slow

Could you send a patch to fix this? @dentiny

EDIT: Looks like it now performs 1 less GET :D

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 22, 2026 09:56
@dentiny dentiny marked this pull request as ready for review January 22, 2026 09:56
@dentiny
Copy link
Contributor Author

dentiny commented Jan 22, 2026

Could you send a patch to fix this?

Definitely, two changes made:

  • I found there's also thread-local file handle created, which is used to avoid TCP connection level throttle; make it cache handle as well
  • Update the httpfs sql test expected results, I checked the HTTP request and confirm the change does seem to be related to our caching

HTTP requests made with cached json reader:

memory D SELECT request FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                   request                                                                                    │
│                         struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar))                          │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.026982+00', 'duration_ms': 40,                                   │
│   'headers': {Range='bytes=0-1048575', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                            │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.102178+00', 'duration_ms': 47,                                   │
│   'headers': {Range='bytes=1048576-3145727', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                      │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:31.371647+00', 'duration_ms': 74,                                   │
│   'headers': {Range='bytes=3145728-7340031', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                      │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:32.328241+00', 'duration_ms': 142,                                  │
│   'headers': {Range='bytes=7340032-15728639', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                     │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:32.927187+00', 'duration_ms': 320,                                  │
│   'headers': {Range='bytes=15728640-32505855', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:36.28581+00', 'duration_ms': 529,                                   │
│   'headers': {Range='bytes=32505856-66060287', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:38.340129+00', 'duration_ms': 499,                                  │
│   'headers': {Range='bytes=66060288-99614719', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:34:40.021854+00', 'duration_ms': 126,                                  │
│   'headers': {Range='bytes=99614720-105159016', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                   │
│ }                                                                                                                                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

HTTP requests made without cached json reader:

memory D SELECT request FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                                   request                                                                                    │
│                         struct("type" varchar, url varchar, start_time timestamp with time zone, duration_ms bigint, headers map(varchar, varchar))                          │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.377027+00', 'duration_ms': 41,                                   │
│   'headers': {Range='bytes=0-1048575', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                            │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.432911+00', 'duration_ms': 46,                                   │
│   'headers': {Range='bytes=1048576-3145727', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                      │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:23.515931+00', 'duration_ms': 71,                                   │
│   'headers': {Range='bytes=3145728-7340031', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                      │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.034297+00', 'duration_ms': 127,                                  │
│   'headers': {Range='bytes=7340032-15728639', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                     │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.481478+00', 'duration_ms': 284,                                  │
│   'headers': {Range='bytes=15728640-32505855', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:24.800335+00', 'duration_ms': 232,                                  │
│   'headers': {Range='bytes=0-16777215', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                           │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:26.553262+00', 'duration_ms': 575,                                  │
│   'headers': {Range='bytes=16777216-50331647', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:28.204751+00', 'duration_ms': 508,                                  │
│   'headers': {Range='bytes=50331648-83886079', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                    │
│ }                                                                                                                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {                                                                                                                                                                            │
│   'type': GET, 'url': 'https://data.gharchive.org/2023-02-08-0.json.gz', 'start_time': '2026-01-22 09:42:29.8826+00', 'duration_ms': 303,                                    │
│   'headers': {Range='bytes=83886080-105159016', Authorization=Bearersk_test_not_valid_key, User-Agent='duckdb/v1.5.0-dev5863(linux_arm64) cpp d5fe3f9287'}                   │
│ }                                                                                                                                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@lnkuiper
Copy link
Collaborator

Awesome, thanks!

@dentiny dentiny marked this pull request as draft January 22, 2026 12:07
@dentiny dentiny marked this pull request as ready for review January 22, 2026 12:07
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 22, 2026 12:13
@dentiny dentiny marked this pull request as ready for review January 22, 2026 17:54
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 24, 2026 12:40
@Mytherin Mytherin marked this pull request as ready for review January 24, 2026 12:42
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 24, 2026 13:49
@dentiny dentiny marked this pull request as ready for review January 24, 2026 13:50
@dentiny dentiny force-pushed the hjiang/json-reader-cache-integration branch from a8e04d2 to c085b2a Compare January 24, 2026 13:53
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 24, 2026 13:54
@dentiny dentiny marked this pull request as ready for review January 24, 2026 13:58
@dentiny
Copy link
Contributor Author

dentiny commented Jan 25, 2026

The failed assertion comes from test/sql/storage/types/variant/append_shredded.test
With error message and stacktrace:

ABORT THROWN BY INTERNAL EXCEPTION: Assertion triggered in file "/home/runner/work/duckdb/duckdb/src/storage/compression/validity_uncompressed.cpp" on line 229: result_mask.RowIsValid(result_offset + i)

build/relassert/test/unittest(+0x14058db2) [0x55b2bc103db2]
<omit others>

Not seems to be related to my change :(

@lnkuiper
Copy link
Collaborator

There have been some issues in CI - nothing to worry about though. Could you merge with v1.5-variegata again and re-trigger CI?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 29, 2026 08:37
@dentiny dentiny marked this pull request as ready for review January 29, 2026 08:38
@dentiny
Copy link
Contributor Author

dentiny commented Jan 29, 2026

  1. /home/runner/work/duckdb/duckdb/build/release/_deps/httpfs_extension_fc-src/test/sql/logging/http_logging.test:25
    ================================================================================
    Wrong result in query! (/home/runner/work/duckdb/duckdb/build/release/_deps/httpfs_extension_fc-src/test/sql/logging/http_logging.test:25)!
    ================================================================================
    SELECT request.headers['Range'], response.headers['Content-Range']
    FROM duckdb_logs_parsed('HTTP')
    WHERE request.type='GET';
    ================================================================================
    Mismatch on row 1, column response.headers['Content-Range'](index 2)
    NULL <> bytes 0-1275/1276
    ================================================================================
    Expected result:
    ================================================================================
    bytes=0-1275 bytes 0-1275/1276
    ================================================================================
    Actual result:
    ================================================================================
    bytes=0-1275 NULL

The test fails on CSV read, which I don't even touch in the PR, I will take a look.

@lnkuiper
Copy link
Collaborator

lnkuiper commented Feb 2, 2026

This is unrelated, I've seen it break in other PRs. Thanks for all the changes, this can be merged :)

@lnkuiper lnkuiper merged commit a12b16c into duckdb:v1.5-variegata Feb 2, 2026
101 of 102 checks passed
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Feb 28, 2026
Date: 2026-02-02 13:44:38 +0100

Integrate caching filesystem into json reader (duckdb/duckdb#20575)
[chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772)
Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 28, 2026
Date: 2026-02-02 13:44:38 +0100

Integrate caching filesystem into json reader (duckdb/duckdb#20575)
[chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772)
Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Feb 28, 2026
Date: 2026-02-02 13:44:38 +0100

Integrate caching filesystem into json reader (duckdb/duckdb#20575)
[chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772)
Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Feb 28, 2026
Date: 2026-02-02 13:44:38 +0100

Integrate caching filesystem into json reader (duckdb/duckdb#20575)
[chore] Skip building spatial, needs a patch, needs to be reverted (duckdb/duckdb#20772)
Add back spatial to V1.5, patch ducklake (duckdb/duckdb#20734)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants