Skip to content

GEOMETRY Rework: Part 4 - Step 2 - Adjust WKB conversion functions#19848

Merged
Mytherin merged 1 commit intoduckdb:mainfrom
Maxxen:core-geom-step-4-fixup-parquet-step-2
Nov 20, 2025
Merged

GEOMETRY Rework: Part 4 - Step 2 - Adjust WKB conversion functions#19848
Mytherin merged 1 commit intoduckdb:mainfrom
Maxxen:core-geom-step-4-fixup-parquet-step-2

Conversation

@Maxxen
Copy link
Member

@Maxxen Maxxen commented Nov 19, 2025

This is a followup PR that builds on top of #19476. Please have a look at #19136 for the context behind this PR.

I realized I the Geometry::FromBinary/Geometry::ToBinary helper functions need to be adjusted slightly so that they can be used to implement the cast functions provided in duckdb-spatial. These casts may move to core eventually, but for now this is required to integrate the spatial extension with the new geometry type smoothly.

@Maxxen Maxxen marked this pull request as ready for review November 19, 2025 16:49
@Mytherin Mytherin merged commit d6dce8b into duckdb:main Nov 20, 2025
99 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

lnkuiper added a commit that referenced this pull request Dec 30, 2025
This is a followup PR that builds on top of
#19848. Please have a look at
#19136 for the context behind this
PR.

This PR adds initial support for parameterizing the `GEOMETRY` type by a
_coordinate reference system_, along with a pair of `ST_CRS` and
`ST_GetCRS` scalar functions to set/get this parameter at the type
level.

The coordinate reference system type parameter is an arbitrary string,
but we try to parse it to identify if it is in WKT2, PROJJSON or
AUTH:CODE format. If it is, we extract and cache the "id" and "name"
fields, and use them instead of the full string when printing the type
name.

When parsing PROJJSON or WKT2 we just verify that the PROJJSON is valid
json, and that WKT2 is syntactically valid... in whatever encoding it
uses. We don't inspect if the keywords/fields other than those required
to extract the name and id are set correctly.

Two parameterized geometry types are treated as equal if the "id" of
their coordinate system string is equal, otherwise we compare the
"name", and finally compare the full string character by character.

This PR also updates the parquet extension so that the CRS is propagated
to/from the Parquet type and GeoParquet json metadata.

# What is a coordinate reference system (CRS)?

Because geometries are made up of arbitrary coordinates in planar space,
it's important that a dataset can be associated with a coordinate system
so that you can meaningfully interpret what the coordinates actually
represent. Coordinate reference systems are basically the spatial
equivalent of time-zones in the temporal domain. While it would be
convenient if the whole world always operated on coordinates in degrees
of [longitude, latitude] (or always used UTC for timestamps), in
practice it is very common that geospatial data is provided in a
coordinate system defined for a specific local area of use.

## _Where_ are CRS's stored?

DuckDB (in duckdb-spatial) has previously not enabled a built-in way to
associate geometries with their coordinate system, leaving it up to the
user to either keep this sort of metadata in a separate column or track
it "out of band" outside of duckdb itself. Both solutions are somewhat
cumbersome and error prone.

Other databases that support geometries tend to do some combination of:

1. Inline a "spatial reference identifier" integer (SRID) into each
geometry _value_, which can be used to perform a look-up in a separate
(system) table that maps SRID's to coordinate system definitions.
2. Keep a separate metadata table/view (e.g. SPATIAL_REF_SYS) which
defines the coordinate reference system for each geometry _column_ in
the database.

For DuckDB It feels natural that the coordinate system would be set for
a whole column, and not per value. But neither option seems suitable as
they rely on specific system tables to be populated and present, and all
coordinate systems to be known and defined _a priori_ , which would add
a lot of friction to the more ephemeral and multi-data-source workflows
that DuckDB is typically used for.

Therefore we are instead putting the coordinate system into the geometry
_type_ itself, similar to how _data frame_ libraries tend to treat
geometry columns.

E.g. you can now create a table for a geometry column in the coordinate
system defined by the _European Petroleum Survey Group_ with id `4326`
as:

```sql
CREATE TABLE t1(g GEOMETRY('EPSG:4326'));

INSERT INTO t1 VALUES ('POINT (0 1)');

SELECT g, ST_CRS(g) FROM t1;
----
┌─────────────────────┬───────────┐
│          g          │ st_crs(g) │
│ geometry(epsg:4326) │  varchar  │
├─────────────────────┼───────────┤
│ POINT (0 1)         │ EPSG:4326 │
└─────────────────────┴───────────┘
```

Moving the coordinate system into the type system also has other
benefits like being able to error-out at bind-time if users are
attempting to operate on geometries that belong to two different
coordinate systems, and making it possible to infer the coordinate
system of an arbitrary geometry expression when importing/exporting to
different geospatial data formats, without having to do a separate table
lookup.

E.g. trying to mix geometries with different coordinate systems throws a
binder exception.
```sql

INSERT INTO t1 VALUES (ST_SetCRS('POINT (0 1)', 'EPSG:3857'));
----
Binder Error:
Cannot cast GEOMETRY with CRS 'EPSG:3857' to GEOMETRY with different CRS 'EPSG:4326'
```

However, you can always implicitly cast a geometry column _with_ a CRS
both to and from a column _without_ a CRS, because at the end of the day
CRS is just metadata.

## _How_ are CRS's stored?

Here comes the difficult part. 

The modern data model of a coordinate reference system definition is
conceptually defined in `ISO 19111` but is mostly encoded as strings in
`WKT1`, `WKT2` and `PROJJSON` format. These strings can be quite large
and unwieldy to pass around by users directly, hence why e.g. PostGIS
uses integer id (SRID's) to identify coordinate systems within a
Postgres installation instead. Coordinate systems are also commonly
referred by users using a shorthand "AUTH:CODE" format, like e.g.
`EPSG:4326` in the example above.

Similar to integer SRID's, the `AUTH:CODE` shorthand is lossy in that
its not possible to extract the actual "definition" of the coordinate
system without having a separate database or table to identify the auth
code. Since `duckdb-spatial` embeds the `PROJ` library and accompanying
database, it can be used to resolve almost any `AUTH:CODE` (or any
arbitrary string really) to a complete coordinate system definition.

Therefore, it makes a lot of sense for us to simply allow any sort of
string in the CRS field of a geometry type. This is also how the (new)
parquet geometry spec, as well as iceberg and delta deal with this
problem. Just store a string and interpret it when needed, possibly by
referring to some external data source (e.g. PROJ database, or table
property) Since core DuckDB doesn't need to perform any coordinate
transformations, why would we need the actual full definition of a
coordinate system?

Well, the problem arises when external formats that we may want to
import/export to do have requirements on the format of the CRS.
GeoParquet (v1) requires that the CRS is provided in PROJJSON format,
GPKG (The SQLite geo-format) requires WKT1 (or WKT2), and PostGIS
requires WKT1. Therefore, this currently raises an error, and is kinda
bad UX (IMO):

```sql
-- Create a table with some random AUTH:CODE CRS
CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337'));
...
COPY t1 TO 'test_random_crs.parquet';
----
Invalid Input Error: Cannot write GeoParquet V1 metadata for column 'g': GeoParquet only supports PROJJSON CRS definitions
```

While the following is ok, but also comes with its own shitty UX (having
to pass the full PROJJSON string in the type definition)

```sql
CREATE TABLE t1 (g GEOMETRY(' "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
  "type": "ProjectedCRS",
  "name": "WGS 84 / Pseudo-Mercator",
  "base_crs": {
  ...
  200 more lines...
  ...
}');

-- Success!
COPY t1 TO 'test_ok_crs.parquet'
```

# Future work 

In practice `spatial` can solve almost all of the aforementioned
problems thanks to the `PROJ` library, but we can't (and don't want to)
always assume the `spatial` extension is loaded.

Therefore in a future PR the plan is to introduce a
`CoordinateReferenceSystemUtil` class that by default can "identify" and
expand some of the most common coordinate systems from their auth code
to a full projjson definition.

The implementation of the `CoordinateReferenceSystemUtil` can then be
overridden by `spatial` when it is loaded. We may also want to consider
splitting out the `PROJ` part from spatial into its own smaller
extension that can be auto-loaded on demand, much like how e.g. the
`icu` extension is auto-loaded for timezones/collations.

One open question is that Im still not sure of is if we want to
"eagerly" try to identify/expand auth-codes/user input to PROJJSON on
e.g. table creation or dataset import, or if we fully defer any
interpretation lazily "until needed" when exporting to an external
dataset that imposes some requirements on the CRS string. Maybe we want
to make this configurable through a setting.
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 31, 2025
`GEOMETRY` Rework: Part 4 - Step 2 - Adjust WKB conversion functions (duckdb/duckdb#19848)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 31, 2025
`GEOMETRY` Rework: Part 4 - Step 2 - Adjust WKB conversion functions (duckdb/duckdb#19848)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Mytherin added a commit that referenced this pull request Jan 15, 2026
This is a followup PR that builds on top of
#19848 (although orthogonal to
#20143). Please have a look at
#19136 for the context behind this
PR.

This PR enables support for "shredding" geometry columns in our storage
format by internally decomposing the geometries into multiple separate
column segments when all the rows within a row-group are of the same
geometry sub-type. For example, if a row group only contains `POINT`
(XY) geometries, the column segment for that row-group gets rewritten
and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB`
when checkpointing. This encoding is similar to how [GeoArrow]() encodes
typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs
into columnar lists and structs of doubles like this is at the
storage-level is that they compress _significantly_ better in the
decomposed form. By decomposing the geometries into column segment of
primitive types we automatically benefit from DuckDB's existing (and
future!) best-in-class adaptive compression algorithms for each type of
component. E.g. the coordinate vertices get ALP-compressed, and polygon
ring-offsets get RLE/delta/dictionary encoded automatically based on the
distribution of the data.

In my (limited) testing, a fully shredded column is about half the size
as the blob-equivalent, i.e. shredding results in a 2x compression
ratio. This means that you get much faster reads from disk, and will be
able to keep more data in memory (as we now also compress in-memory
segments).

There are currently two caveats: 
- We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they
can be recursive and don't have a fixed layout.
- We only shred if the _entire_ row-group is of a single geometry
sub-type. So if you have 10000 points, insert a linestring and
checkpoint, the column segment will be reverted back to physical type
`BLOB`.
- We don't shred segments containing EMPTY geometries, either at the
root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep
multiple shredded segments around. But Im not sure if the
performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some
interesting future optimizations opportunities:

- Push down certain filters (like bounding-box-intersections) much
deeper into the storage as they are faster to evaluate on separate
coordinate-arrays axis-by-axis.
- Emit "shredded" geometries directly from storage when casting or
exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read
from a fully shredded column have overloads that take the shredded
representation instead (e.g. spatial's `POINT_2D`), replace all
functions (or insert casts where needed) and emit the shredded
representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no
additional storage overhead. In other words, a non-shredded geometry
column is serialized exactly the same way as it used to (before this PR
lands).

I've made some changes to how `ColumnData` is serialized though. There
is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the
`PersistentColumnData` struct that separates out the extra info required
by the `VariantColumnData` and `GeoColumnData`. As they are the only two
column data types where the type/layout can differ between segments
within the same column, they need to store some information to
reconstruct the layout and can't just rely on the top-level column
logical type. In the case of the variant, this is the "shredded" type,
but since geometries have a fixed number of layouts, we only store the
geometry and vertex-sub type (enums) instead of the equivalent logical
type to save space. I expect that we may want to generalize this further
in the future if we start implementing shredding to other types as well
(e.g. strings/json).
d-justen pushed a commit to d-justen/duckdb that referenced this pull request Jan 19, 2026
This is a followup PR that builds on top of
duckdb#19848 (although orthogonal to
duckdb#20143). Please have a look at
duckdb#19136 for the context behind this
PR.

This PR enables support for "shredding" geometry columns in our storage
format by internally decomposing the geometries into multiple separate
column segments when all the rows within a row-group are of the same
geometry sub-type. For example, if a row group only contains `POINT`
(XY) geometries, the column segment for that row-group gets rewritten
and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB`
when checkpointing. This encoding is similar to how [GeoArrow]() encodes
typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs
into columnar lists and structs of doubles like this is at the
storage-level is that they compress _significantly_ better in the
decomposed form. By decomposing the geometries into column segment of
primitive types we automatically benefit from DuckDB's existing (and
future!) best-in-class adaptive compression algorithms for each type of
component. E.g. the coordinate vertices get ALP-compressed, and polygon
ring-offsets get RLE/delta/dictionary encoded automatically based on the
distribution of the data.

In my (limited) testing, a fully shredded column is about half the size
as the blob-equivalent, i.e. shredding results in a 2x compression
ratio. This means that you get much faster reads from disk, and will be
able to keep more data in memory (as we now also compress in-memory
segments).

There are currently two caveats: 
- We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they
can be recursive and don't have a fixed layout.
- We only shred if the _entire_ row-group is of a single geometry
sub-type. So if you have 10000 points, insert a linestring and
checkpoint, the column segment will be reverted back to physical type
`BLOB`.
- We don't shred segments containing EMPTY geometries, either at the
root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep
multiple shredded segments around. But Im not sure if the
performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some
interesting future optimizations opportunities:

- Push down certain filters (like bounding-box-intersections) much
deeper into the storage as they are faster to evaluate on separate
coordinate-arrays axis-by-axis.
- Emit "shredded" geometries directly from storage when casting or
exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read
from a fully shredded column have overloads that take the shredded
representation instead (e.g. spatial's `POINT_2D`), replace all
functions (or insert casts where needed) and emit the shredded
representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no
additional storage overhead. In other words, a non-shredded geometry
column is serialized exactly the same way as it used to (before this PR
lands).

I've made some changes to how `ColumnData` is serialized though. There
is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the
`PersistentColumnData` struct that separates out the extra info required
by the `VariantColumnData` and `GeoColumnData`. As they are the only two
column data types where the type/layout can differ between segments
within the same column, they need to store some information to
reconstruct the layout and can't just rely on the top-level column
logical type. In the case of the variant, this is the "shredded" type,
but since geometries have a fixed number of layouts, we only store the
geometry and vertex-sub type (enums) instead of the equivalent logical
type to save space. I expect that we may want to generalize this further
in the future if we start implementing shredding to other types as well
(e.g. strings/json).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants