GEOMETRY Rework: Part 4 - Step 2 - Adjust WKB conversion functions#19848
Merged
Mytherin merged 1 commit intoduckdb:mainfrom Nov 20, 2025
Merged
Conversation
Collaborator
|
Thanks! |
lnkuiper
added a commit
that referenced
this pull request
Dec 30, 2025
This is a followup PR that builds on top of #19848. Please have a look at #19136 for the context behind this PR. This PR adds initial support for parameterizing the `GEOMETRY` type by a _coordinate reference system_, along with a pair of `ST_CRS` and `ST_GetCRS` scalar functions to set/get this parameter at the type level. The coordinate reference system type parameter is an arbitrary string, but we try to parse it to identify if it is in WKT2, PROJJSON or AUTH:CODE format. If it is, we extract and cache the "id" and "name" fields, and use them instead of the full string when printing the type name. When parsing PROJJSON or WKT2 we just verify that the PROJJSON is valid json, and that WKT2 is syntactically valid... in whatever encoding it uses. We don't inspect if the keywords/fields other than those required to extract the name and id are set correctly. Two parameterized geometry types are treated as equal if the "id" of their coordinate system string is equal, otherwise we compare the "name", and finally compare the full string character by character. This PR also updates the parquet extension so that the CRS is propagated to/from the Parquet type and GeoParquet json metadata. # What is a coordinate reference system (CRS)? Because geometries are made up of arbitrary coordinates in planar space, it's important that a dataset can be associated with a coordinate system so that you can meaningfully interpret what the coordinates actually represent. Coordinate reference systems are basically the spatial equivalent of time-zones in the temporal domain. While it would be convenient if the whole world always operated on coordinates in degrees of [longitude, latitude] (or always used UTC for timestamps), in practice it is very common that geospatial data is provided in a coordinate system defined for a specific local area of use. ## _Where_ are CRS's stored? DuckDB (in duckdb-spatial) has previously not enabled a built-in way to associate geometries with their coordinate system, leaving it up to the user to either keep this sort of metadata in a separate column or track it "out of band" outside of duckdb itself. Both solutions are somewhat cumbersome and error prone. Other databases that support geometries tend to do some combination of: 1. Inline a "spatial reference identifier" integer (SRID) into each geometry _value_, which can be used to perform a look-up in a separate (system) table that maps SRID's to coordinate system definitions. 2. Keep a separate metadata table/view (e.g. SPATIAL_REF_SYS) which defines the coordinate reference system for each geometry _column_ in the database. For DuckDB It feels natural that the coordinate system would be set for a whole column, and not per value. But neither option seems suitable as they rely on specific system tables to be populated and present, and all coordinate systems to be known and defined _a priori_ , which would add a lot of friction to the more ephemeral and multi-data-source workflows that DuckDB is typically used for. Therefore we are instead putting the coordinate system into the geometry _type_ itself, similar to how _data frame_ libraries tend to treat geometry columns. E.g. you can now create a table for a geometry column in the coordinate system defined by the _European Petroleum Survey Group_ with id `4326` as: ```sql CREATE TABLE t1(g GEOMETRY('EPSG:4326')); INSERT INTO t1 VALUES ('POINT (0 1)'); SELECT g, ST_CRS(g) FROM t1; ---- ┌─────────────────────┬───────────┐ │ g │ st_crs(g) │ │ geometry(epsg:4326) │ varchar │ ├─────────────────────┼───────────┤ │ POINT (0 1) │ EPSG:4326 │ └─────────────────────┴───────────┘ ``` Moving the coordinate system into the type system also has other benefits like being able to error-out at bind-time if users are attempting to operate on geometries that belong to two different coordinate systems, and making it possible to infer the coordinate system of an arbitrary geometry expression when importing/exporting to different geospatial data formats, without having to do a separate table lookup. E.g. trying to mix geometries with different coordinate systems throws a binder exception. ```sql INSERT INTO t1 VALUES (ST_SetCRS('POINT (0 1)', 'EPSG:3857')); ---- Binder Error: Cannot cast GEOMETRY with CRS 'EPSG:3857' to GEOMETRY with different CRS 'EPSG:4326' ``` However, you can always implicitly cast a geometry column _with_ a CRS both to and from a column _without_ a CRS, because at the end of the day CRS is just metadata. ## _How_ are CRS's stored? Here comes the difficult part. The modern data model of a coordinate reference system definition is conceptually defined in `ISO 19111` but is mostly encoded as strings in `WKT1`, `WKT2` and `PROJJSON` format. These strings can be quite large and unwieldy to pass around by users directly, hence why e.g. PostGIS uses integer id (SRID's) to identify coordinate systems within a Postgres installation instead. Coordinate systems are also commonly referred by users using a shorthand "AUTH:CODE" format, like e.g. `EPSG:4326` in the example above. Similar to integer SRID's, the `AUTH:CODE` shorthand is lossy in that its not possible to extract the actual "definition" of the coordinate system without having a separate database or table to identify the auth code. Since `duckdb-spatial` embeds the `PROJ` library and accompanying database, it can be used to resolve almost any `AUTH:CODE` (or any arbitrary string really) to a complete coordinate system definition. Therefore, it makes a lot of sense for us to simply allow any sort of string in the CRS field of a geometry type. This is also how the (new) parquet geometry spec, as well as iceberg and delta deal with this problem. Just store a string and interpret it when needed, possibly by referring to some external data source (e.g. PROJ database, or table property) Since core DuckDB doesn't need to perform any coordinate transformations, why would we need the actual full definition of a coordinate system? Well, the problem arises when external formats that we may want to import/export to do have requirements on the format of the CRS. GeoParquet (v1) requires that the CRS is provided in PROJJSON format, GPKG (The SQLite geo-format) requires WKT1 (or WKT2), and PostGIS requires WKT1. Therefore, this currently raises an error, and is kinda bad UX (IMO): ```sql -- Create a table with some random AUTH:CODE CRS CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337')); ... COPY t1 TO 'test_random_crs.parquet'; ---- Invalid Input Error: Cannot write GeoParquet V1 metadata for column 'g': GeoParquet only supports PROJJSON CRS definitions ``` While the following is ok, but also comes with its own shitty UX (having to pass the full PROJJSON string in the type definition) ```sql CREATE TABLE t1 (g GEOMETRY(' "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "WGS 84 / Pseudo-Mercator", "base_crs": { ... 200 more lines... ... }'); -- Success! COPY t1 TO 'test_ok_crs.parquet' ``` # Future work In practice `spatial` can solve almost all of the aforementioned problems thanks to the `PROJ` library, but we can't (and don't want to) always assume the `spatial` extension is loaded. Therefore in a future PR the plan is to introduce a `CoordinateReferenceSystemUtil` class that by default can "identify" and expand some of the most common coordinate systems from their auth code to a full projjson definition. The implementation of the `CoordinateReferenceSystemUtil` can then be overridden by `spatial` when it is loaded. We may also want to consider splitting out the `PROJ` part from spatial into its own smaller extension that can be auto-loaded on demand, much like how e.g. the `icu` extension is auto-loaded for timezones/collations. One open question is that Im still not sure of is if we want to "eagerly" try to identify/expand auth-codes/user input to PROJJSON on e.g. table creation or dataset import, or if we fully defer any interpretation lazily "until needed" when exporting to an external dataset that imposes some requirements on the CRS string. Maybe we want to make this configurable through a setting.
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 31, 2025
`GEOMETRY` Rework: Part 4 - Step 2 - Adjust WKB conversion functions (duckdb/duckdb#19848)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Dec 31, 2025
`GEOMETRY` Rework: Part 4 - Step 2 - Adjust WKB conversion functions (duckdb/duckdb#19848) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Mytherin
added a commit
that referenced
this pull request
Jan 15, 2026
This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
d-justen
pushed a commit
to d-justen/duckdb
that referenced
this pull request
Jan 19, 2026
This is a followup PR that builds on top of duckdb#19848 (although orthogonal to duckdb#20143). Please have a look at duckdb#19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a followup PR that builds on top of #19476. Please have a look at #19136 for the context behind this PR.
I realized I the
Geometry::FromBinary/Geometry::ToBinaryhelper functions need to be adjusted slightly so that they can be used to implement the cast functions provided induckdb-spatial. These casts may move to core eventually, but for now this is required to integrate the spatial extension with the new geometry type smoothly.