Skip to content

GEOMETRY Rework: Part 6 - Geometry "Shredding"#20281

Merged
Mytherin merged 31 commits intoduckdb:v1.5-variegatafrom
Maxxen:core-geom-step-5-shredding
Jan 15, 2026
Merged

GEOMETRY Rework: Part 6 - Geometry "Shredding"#20281
Mytherin merged 31 commits intoduckdb:v1.5-variegatafrom
Maxxen:core-geom-step-5-shredding

Conversation

@Maxxen
Copy link
Member

@Maxxen Maxxen commented Dec 22, 2025

This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR.

This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains POINT (XY) geometries, the column segment for that row-group gets rewritten and stored internally as STRUCT(x DOUBLE, y DOUBLE) instead of BLOB when checkpointing. This encoding is similar to how GeoArrow encodes typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress significantly better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data.

In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments).

There are currently two caveats:

  • We don't shred segments that contain GEOMETRYCOLLECTIONs, as they can be recursive and don't have a fixed layout.
  • We only shred if the entire row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type BLOB.
  • We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities:

  • Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis.
  • Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow
    • Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's POINT_2D), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands).

I've made some changes to how ColumnData is serialized though. There is now an abstract unique_ptr<ExtraPersistentColumnData> in the PersistentColumnData struct that separates out the extra info required by the VariantColumnData and GeoColumnData. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).

@Maxxen Maxxen force-pushed the core-geom-step-5-shredding branch from 1c72f93 to 87a28a5 Compare December 22, 2025 16:04
@Maxxen Maxxen force-pushed the core-geom-step-5-shredding branch from abc6674 to 6a786b5 Compare December 23, 2025 17:06
@Maxxen Maxxen marked this pull request as ready for review January 5, 2026 14:04
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 8, 2026 14:15
@Maxxen Maxxen marked this pull request as ready for review January 8, 2026 21:43
@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 9, 2026 07:09
@Maxxen Maxxen marked this pull request as ready for review January 9, 2026 07:09
@Maxxen
Copy link
Member Author

Maxxen commented Jan 9, 2026

@Mytherin All green!

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks good - some comments

@Mytherin
Copy link
Collaborator

Mytherin commented Jan 9, 2026

Also this should probably target v1.5

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 13, 2026 13:54
@Maxxen Maxxen force-pushed the core-geom-step-5-shredding branch from 826d130 to 10c8fc0 Compare January 13, 2026 14:12
@Maxxen Maxxen changed the base branch from main to v1.5-variegata January 13, 2026 14:12
@Maxxen
Copy link
Member Author

Maxxen commented Jan 13, 2026

@Mytherin I've adapted to your feedback, implemented the remaining borked methods and rebased to v1.5 - I'm gonna let it simmer on my own CI for a bit but let me know if there is anything else you think should be done.

@Maxxen Maxxen marked this pull request as ready for review January 13, 2026 23:37
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good - some minor comments otherwise this is good to go

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 14, 2026 20:34
@Maxxen Maxxen marked this pull request as ready for review January 14, 2026 20:43
@Maxxen
Copy link
Member Author

Maxxen commented Jan 15, 2026

@Mytherin all green!

"name": "geometry_minimum_shredding_size",
"description": "Minimum size of a rowgroup to enable GEOMETRY shredding, or set to -1 to disable entirely. Defaults to 1/4th of a rowgroup",
"type": "BIGINT",
"default_scope": "global",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "scope": "global" now since setting it client-side has no effect, as we're only using this with the DBConfig and not with a client context. Can be fixed in a follow-up.

@Mytherin Mytherin merged commit 9d44bdd into duckdb:v1.5-variegata Jan 15, 2026
59 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

d-justen pushed a commit to d-justen/duckdb that referenced this pull request Jan 19, 2026
This is a followup PR that builds on top of
duckdb#19848 (although orthogonal to
duckdb#20143). Please have a look at
duckdb#19136 for the context behind this
PR.

This PR enables support for "shredding" geometry columns in our storage
format by internally decomposing the geometries into multiple separate
column segments when all the rows within a row-group are of the same
geometry sub-type. For example, if a row group only contains `POINT`
(XY) geometries, the column segment for that row-group gets rewritten
and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB`
when checkpointing. This encoding is similar to how [GeoArrow]() encodes
typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs
into columnar lists and structs of doubles like this is at the
storage-level is that they compress _significantly_ better in the
decomposed form. By decomposing the geometries into column segment of
primitive types we automatically benefit from DuckDB's existing (and
future!) best-in-class adaptive compression algorithms for each type of
component. E.g. the coordinate vertices get ALP-compressed, and polygon
ring-offsets get RLE/delta/dictionary encoded automatically based on the
distribution of the data.

In my (limited) testing, a fully shredded column is about half the size
as the blob-equivalent, i.e. shredding results in a 2x compression
ratio. This means that you get much faster reads from disk, and will be
able to keep more data in memory (as we now also compress in-memory
segments).

There are currently two caveats: 
- We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they
can be recursive and don't have a fixed layout.
- We only shred if the _entire_ row-group is of a single geometry
sub-type. So if you have 10000 points, insert a linestring and
checkpoint, the column segment will be reverted back to physical type
`BLOB`.
- We don't shred segments containing EMPTY geometries, either at the
root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep
multiple shredded segments around. But Im not sure if the
performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some
interesting future optimizations opportunities:

- Push down certain filters (like bounding-box-intersections) much
deeper into the storage as they are faster to evaluate on separate
coordinate-arrays axis-by-axis.
- Emit "shredded" geometries directly from storage when casting or
exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read
from a fully shredded column have overloads that take the shredded
representation instead (e.g. spatial's `POINT_2D`), replace all
functions (or insert casts where needed) and emit the shredded
representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no
additional storage overhead. In other words, a non-shredded geometry
column is serialized exactly the same way as it used to (before this PR
lands).

I've made some changes to how `ColumnData` is serialized though. There
is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the
`PersistentColumnData` struct that separates out the extra info required
by the `VariantColumnData` and `GeoColumnData`. As they are the only two
column data types where the type/layout can differ between segments
within the same column, they need to store some information to
reconstruct the layout and can't just rely on the top-level column
logical type. In the case of the variant, this is the "shredded" type,
but since geometries have a fixed number of layouts, we only store the
geometry and vertex-sub type (enums) instead of the equivalent logical
type to save space. I expect that we may want to generalize this further
in the future if we start implementing shredding to other types as well
(e.g. strings/json).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants