GEOMETRY Rework: Part 6 - Geometry "Shredding"#20281
Merged
Mytherin merged 31 commits intoduckdb:v1.5-variegatafrom Jan 15, 2026
Merged
GEOMETRY Rework: Part 6 - Geometry "Shredding"#20281Mytherin merged 31 commits intoduckdb:v1.5-variegatafrom
GEOMETRY Rework: Part 6 - Geometry "Shredding"#20281Mytherin merged 31 commits intoduckdb:v1.5-variegatafrom
Conversation
8 tasks
1c72f93 to
87a28a5
Compare
abc6674 to
6a786b5
Compare
Member
Author
|
@Mytherin All green! |
Mytherin
reviewed
Jan 9, 2026
Collaborator
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the PR! Looks good - some comments
Collaborator
|
Also this should probably target |
…e are empty points
…hredding empty geoms
826d130 to
10c8fc0
Compare
Member
Author
|
@Mytherin I've adapted to your feedback, implemented the remaining borked methods and rebased to v1.5 - I'm gonna let it simmer on my own CI for a bit but let me know if there is anything else you think should be done. |
…-step-5-shredding
Mytherin
reviewed
Jan 14, 2026
Collaborator
Mytherin
left a comment
There was a problem hiding this comment.
Thanks! Looks good - some minor comments otherwise this is good to go
…o core-geom-step-5-shredding
Member
Author
|
@Mytherin all green! |
Mytherin
reviewed
Jan 15, 2026
| "name": "geometry_minimum_shredding_size", | ||
| "description": "Minimum size of a rowgroup to enable GEOMETRY shredding, or set to -1 to disable entirely. Defaults to 1/4th of a rowgroup", | ||
| "type": "BIGINT", | ||
| "default_scope": "global", |
Collaborator
There was a problem hiding this comment.
I think this should be "scope": "global" now since setting it client-side has no effect, as we're only using this with the DBConfig and not with a client context. Can be fixed in a follow-up.
Collaborator
|
Thanks! |
d-justen
pushed a commit
to d-justen/duckdb
that referenced
this pull request
Jan 19, 2026
This is a followup PR that builds on top of duckdb#19848 (although orthogonal to duckdb#20143). Please have a look at duckdb#19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR.
This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains
POINT(XY) geometries, the column segment for that row-group gets rewritten and stored internally asSTRUCT(x DOUBLE, y DOUBLE)instead ofBLOBwhen checkpointing. This encoding is similar to how GeoArrow encodes typed geometries.The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress significantly better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data.
In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments).
There are currently two caveats:
GEOMETRYCOLLECTIONs, as they can be recursive and don't have a fixed layout.BLOB.We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it.
Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities:
POINT_2D), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically.As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands).
I've made some changes to how
ColumnDatais serialized though. There is now an abstractunique_ptr<ExtraPersistentColumnData>in thePersistentColumnDatastruct that separates out the extra info required by theVariantColumnDataandGeoColumnData. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).