`GEOMETRY` Rework: Part 6 - Geometry "Shredding" by Maxxen · Pull Request #20281 · duckdb/duckdb

Maxxen · 2025-12-22T01:01:11Z

This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR.

This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains POINT (XY) geometries, the column segment for that row-group gets rewritten and stored internally as STRUCT(x DOUBLE, y DOUBLE) instead of BLOB when checkpointing. This encoding is similar to how GeoArrow encodes typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress significantly better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data.

In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments).

There are currently two caveats:

We don't shred segments that contain GEOMETRYCOLLECTIONs, as they can be recursive and don't have a fixed layout.
We only shred if the entire row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type BLOB.
We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities:

Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis.
Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's POINT_2D), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands).

I've made some changes to how ColumnData is serialized though. There is now an abstract unique_ptr<ExtraPersistentColumnData> in the PersistentColumnData struct that separates out the extra info required by the VariantColumnData and GeoColumnData. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).

Maxxen · 2026-01-09T14:22:20Z

@Mytherin All green!

Mytherin

Thanks for the PR! Looks good - some comments

src/storage/table/geo_column_data.cpp

src/storage/table/column_data.cpp

test/configs/enable_verification_for_debug.json

src/common/settings.json

Mytherin · 2026-01-09T14:53:15Z

Also this should probably target v1.5

…ypes too

…e are empty points

…hredding empty geoms

Maxxen · 2026-01-13T14:49:42Z

@Mytherin I've adapted to your feedback, implemented the remaining borked methods and rebased to v1.5 - I'm gonna let it simmer on my own CI for a bit but let me know if there is anything else you think should be done.

…-step-5-shredding

Mytherin

Thanks! Looks good - some minor comments otherwise this is good to go

src/storage/table/geo_column_data.cpp

src/include/duckdb/storage/statistics/geometry_stats.hpp

src/include/duckdb/storage/table/column_data.hpp

…o core-geom-step-5-shredding

Maxxen · 2026-01-15T16:15:30Z

@Mytherin all green!

Mytherin · 2026-01-15T16:42:54Z

src/common/settings.json

+        "name": "geometry_minimum_shredding_size",
+        "description": "Minimum size of a rowgroup to enable GEOMETRY shredding, or set to -1 to disable entirely. Defaults to 1/4th of a rowgroup",
+        "type": "BIGINT",
+        "default_scope": "global",


I think this should be "scope": "global" now since setting it client-side has no effect, as we're only using this with the DBConfig and not with a client context. Can be fixed in a follow-up.

Mytherin · 2026-01-15T16:43:10Z

Thanks!

This is a followup PR that builds on top of duckdb#19848 (although orthogonal to duckdb#20143). Please have a look at duckdb#19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).

Maxxen mentioned this pull request Dec 22, 2025

GEOMETRY Rework: Part 1 - Logical Type #19136

Merged

8 tasks

Maxxen force-pushed the core-geom-step-5-shredding branch from 1c72f93 to 87a28a5 Compare December 22, 2025 16:04

pdet added the CI Failure label Dec 23, 2025

Maxxen force-pushed the core-geom-step-5-shredding branch from abc6674 to 6a786b5 Compare December 23, 2025 17:06

Maxxen marked this pull request as ready for review January 5, 2026 14:04

Maxxen added Ready For Review and removed CI Failure labels Jan 5, 2026

duckdb-draftbot marked this pull request as draft January 8, 2026 14:15

Maxxen marked this pull request as ready for review January 8, 2026 21:43

duckdb-draftbot marked this pull request as draft January 9, 2026 07:09

Maxxen marked this pull request as ready for review January 9, 2026 07:09

Mytherin reviewed Jan 9, 2026

View reviewed changes

duckdb-draftbot marked this pull request as draft January 13, 2026 13:54

Maxxen added 15 commits January 13, 2026 14:57

initial attempt at geometry POINT shredding

c5c1243

fix polygon shredding

54aedb5

rework geo column shredding entirely and simplify

708a289

fix misc bugs with validity

173cb25

remove SplitColumnData. Fix serialization - but is ugly, needs to change

12288d0

add include

2935a8b

add more shredding

c106997

serialize shredding state, add support for shredding other geometry t…

b13ad33

…ypes too

fixup variant

551e3f8

shred multi-types too

20a115f

fix failing tests

75c9123

more fixes, hardcode variant for now

8cbf635

require stats to work

d20a545

fix warnings/rebase changes

f9e4c34

deserialize stats correctly, add tests for all subtypes

c57e61c

Maxxen added 11 commits January 13, 2026 15:02

Add support for tracking empty geometry in stats - dont shred if ther…

7aaa5af

…e are empty points

add empty_geom and empty_part to geometry stats and use it to avoid s…

af8268d

…hredding empty geoms

fix block verification

89d306f

add setting to control shredding threshold

87d5a9f

vector size

7127e24

dont checkpoint on shutdown in test

03441e3

push down stats when checkpointing

1785d1b

Fix Fetch and FetchRow, add support to reassemble with offset

cd1e574

change serialization id

9896daf

switch to generic setting

286ff25

rebase on top of v1.5

10c8fc0

Maxxen force-pushed the core-geom-step-5-shredding branch from 826d130 to 10c8fc0 Compare January 13, 2026 14:12

Maxxen changed the base branch from main to v1.5-variegata January 13, 2026 14:12

Maxxen added 2 commits January 14, 2026 00:32

remove dead code

a01063a

fixup test

d9c4171

Maxxen marked this pull request as ready for review January 13, 2026 23:37

Merge remote-tracking branch 'upstream/v1.5-variegata' into core-geom…

a6c58fa

…-step-5-shredding

Mytherin reviewed Jan 14, 2026

View reviewed changes

src/storage/table/geo_column_data.cpp Outdated Show resolved Hide resolved

src/include/duckdb/storage/statistics/geometry_stats.hpp Outdated Show resolved Hide resolved

src/include/duckdb/storage/table/column_data.hpp Show resolved Hide resolved

duckdb-draftbot marked this pull request as draft January 14, 2026 20:34

Maxxen added 2 commits January 14, 2026 21:43

feedback

82f86fe

Merge branch 'v1.5-variegata' of https://github.com/duckdb/duckdb int…

4d50620

…o core-geom-step-5-shredding

Maxxen marked this pull request as ready for review January 14, 2026 20:43

Mytherin reviewed Jan 15, 2026

View reviewed changes

Mytherin merged commit 9d44bdd into duckdb:v1.5-variegata Jan 15, 2026
59 checks passed

Conversation

Maxxen commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maxxen commented Jan 9, 2026

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin commented Jan 9, 2026

Uh oh!

Maxxen commented Jan 13, 2026

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Maxxen commented Jan 15, 2026

Uh oh!

Mytherin Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mytherin commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Maxxen commented Dec 22, 2025 •

edited

Loading