Skip to content

GEOMETRY Rework: Part 1 - Logical Type#19136

Merged
hannes merged 6 commits intoduckdb:mainfrom
Maxxen:core-geom-step-1
Sep 30, 2025
Merged

GEOMETRY Rework: Part 1 - Logical Type#19136
hannes merged 6 commits intoduckdb:mainfrom
Maxxen:core-geom-step-1

Conversation

@Maxxen
Copy link
Member

@Maxxen Maxxen commented Sep 25, 2025

This PR adds a dedicated GEOMETRY logical type into core DuckDB. The internal representation is currently WKB-encoded BLOB, but that will likely change in a future PR. No functions are implemented for this type, except to/from VARCHAR casts.

This is the first PR in a long series of changes thats going to be pushed the coming weeks, with the ultimate goal of significantly elevating DuckDBs geospatial capabilities for the DuckDB v1.5 release in early 2026.

Background

So far DuckDB has (mostly) contained all geospatial features in the spatial extension. This has worked great as it has allowed us to independently and rapidly experiment with how to adapt geospatial processing to DuckDBs engine, while also keeping a lot of the domain-specific details and dependencies separate from core DuckDBs core codebase. Besides integrating a lot of third-party geospatial libraries, spatial has also been integrating deeply into DuckDBs core execution engine and made itself dependent on a lot of fragile interfaces within DuckDBs internals (custom operators, optimizer rules, indexes, etc). Fast forward to today, and spatial is one of DuckDBs largest and most complex extensions.

While this complexity has been somewhat manageable so far, we're now reaching a point where it's no longer feasible to relegate all geospatial related stuff into a single separate extension. There's already some awkwardness when dealing with e.g. Pandas/Postgres/SQLite through DuckDB, which have their own geospatial extensions (GeoPandas, PostGIS, GeoPackage/SpatiaLite) as core DuckDB doesn't really want to acknowledge anything spatial specific and we don't want to introduce inter-extension dependencies either. But geospatial support is now also part of the parquet standard itself, which is of much higher importance to DuckDB, as well as supported by all up and coming data lake formats.

In short: Geospatial data Isn't special (anymore)

What's changing

Therefore we're taking some steps to making vanilla DuckDB spatial aware, by moving the GEOMETRY type from spatial into core DuckDB.

While almost all of the geospatial functionality will still remain in spatial (e.g. 99% of ST_ functions), this will give our (and community!) extensions and client libraries some common ground as they can all interface with the same GEOMETRY type. We will also make sure that existing databases that use the GEOMETRY type as currently defined in spatial will remain compatible.

Additionally, because GEOMETRY will now become part of both DuckDBs execution and storage engine, this opens up a lot of optimization opportunities that are currently impractical/impossible to implement solely in spatial. The two big ones being statistics propagation and compression, which will significantly improve performance of processing both external formats like (Geo)Parquet and DuckDBs own storage format.

Again, this is a pretty massive change. I have prototyped most of it on my own fork(s), but will break it up into multiple PR's keep it manageable. The rough short-term roadmap looks something like this:

Client/other extension integrations etc, etc, is planned to get in before 1.5 as well, but we will first focus on core/parquet/spatial.

This PR also removes spatial from the CI workflow until I've had time adapt it to these changes, but that will hopefully not take too long (I have an old branch with most of the work already)

@Mytherin
Copy link
Collaborator

Thanks! Could you merge with main again?

@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 29, 2025 10:28
@Maxxen Maxxen marked this pull request as ready for review September 29, 2025 10:30
@duckdb-draftbot duckdb-draftbot marked this pull request as draft September 29, 2025 13:25
@Maxxen Maxxen marked this pull request as ready for review September 29, 2025 15:58
@Maxxen
Copy link
Member Author

Maxxen commented Sep 29, 2025

@Mytherin Done!

@Maxxen Maxxen changed the title Add GEOMETRY type GEOMETRY Rework: Part 1 - Logical Type Sep 30, 2025
@hannes
Copy link
Member

hannes commented Sep 30, 2025

thanks!

@hannes hannes merged commit 90690af into duckdb:main Sep 30, 2025
54 checks passed
hannes added a commit that referenced this pull request Oct 8, 2025
This is a followup PR that builds on top of #19136. Please have a look
at #19136 for the context behind this PR.

This PR adds a new statistics type, the `GeometryStatistics` which keeps
track of the spatial extent and the list of geometry subtypes within a
column. This can be used in the future to push down certain spatial
filters into storage to greatly accelerate data retrieval by skipping
row groups that cant intersect with a spatial predicate, or wont contain
geometries of a specific type. It can also be used when writing to
storage to select specialized compression methods, or at planning time
to swap e.g. scalar functions with implementations specialized for
certain geometry layouts.

The parquet reader already had its own notion of "Geometry Stats",
introduced in #18832, but this has now been unified to use the same
structs/logics as the one in core, which significantly reduces the
spatial-specific code in the parquet extension.

Following the roadmap outlined in #19136, the next PR will actually add
support for filter pushdown of bounding-box intersection queries, by
implementing the `&&` intersection operator and using these new stats to
prune row groups in our storage. Ill also try to fixup the parquet
reader while I'm at it so that it can output the new `GEOMETRY` type
without requiring `spatial`.
Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Comment on lines +444 to +446
if (byte_order != 1) {
throw InvalidInputException("Unsupported byte order %d in WKB", byte_order);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you rewrite WKB when reading from Parquet before it will get here, but unfortunately the default on JTS is big endian which resulted in the widely used Overture maps data being published in big endian wkb.

Comment on lines +488 to +489
;
writer.Write(flag_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably unintended?

Comment on lines +76 to +78
(4, 'LINESTRING Z (0 0 0, 1 1 1, 2 2 2)'),
(5, 'LINESTRING M (0 0 0, 1 1 1, 2 2 2)'),
(6, 'LINESTRING ZM (0 0 0 0, 1 1 1 1, 2 2 2 2)'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases would be slightly better if Z and M values were different (I have personally been burned by mixing these up and having a test case that wasn't able to catch it)

krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Oct 21, 2025
lnkuiper added a commit that referenced this pull request Oct 27, 2025
This is a followup PR that builds on top of
#19203. Please have a look at
#19136 for the context behind this
PR.

In #19203 we added support for
storing geometry statistics. In this PR we now add a `&&`
bounding-box-intersection binary operator that when used in a filter can
be pushed down into storage and prune row-groups based on these new
geometry statistics.

Geometry-filter pushdown works for our own storage format, as well as
parquet (if the parquet file contains parquet-native geometry stats).
I've also cleaned up the parquet extension further and removed all
spatial-dependent code, so that all geoparquet related functionality now
works without requiring spatial to be loaded, which simplifies both the
read and write-path for geometries significantly. It also enables us to
run the geoparquet tests in CI when spatial isn't available.

I've also added a couple of basic geometry scalar-functions to convert
to/from WKB and WKT. These will be more fleshed out in the future (to
e.g. handle big-endian WKB and specify WKT precision).
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Nov 1, 2025
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Nov 2, 2025
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Nov 2, 2025
Mytherin added a commit that referenced this pull request Nov 7, 2025
…rt (#19476)

This is a followup PR that builds on top of
#19439. Please have a look at
#19136 for the context behind this
PR.

This PR fixes up the remaining issues in the parquet extension related
to geometries. When reading geometry columns we now push an expression
column reader on top of the underlying blob column reader to perform the
WKB parsing with `ST_GeomFromWKB`. `ST_GeomFromWKB` now actually checks
that the input is valid WKB and also converts from big-endian WKB to
little-endian If required. This can be optimized further, but It's good
enough for now.

I've also added support for converting geometry columns to/from arrow
arrays with geoarrow extension metadata. This code is basically lifted
[straight from the spatial
extension](https://github.com/duckdb/duckdb-spatial/blob/v1.4-andium/src/spatial/spatial_geoarrow.cpp).
Mytherin added a commit that referenced this pull request Nov 20, 2025
…19848)

This is a followup PR that builds on top of
#19476. Please have a look at
#19136 for the context behind this
PR.

I realized I the `Geometry::FromBinary`/`Geometry::ToBinary` helper
functions need to be adjusted slightly so that they can be used to
implement the cast functions provided in `duckdb-spatial`. These casts
may move to core eventually, but for now this is required to integrate
the spatial extension with the new geometry type smoothly.
lnkuiper added a commit that referenced this pull request Dec 30, 2025
This is a followup PR that builds on top of
#19848. Please have a look at
#19136 for the context behind this
PR.

This PR adds initial support for parameterizing the `GEOMETRY` type by a
_coordinate reference system_, along with a pair of `ST_CRS` and
`ST_GetCRS` scalar functions to set/get this parameter at the type
level.

The coordinate reference system type parameter is an arbitrary string,
but we try to parse it to identify if it is in WKT2, PROJJSON or
AUTH:CODE format. If it is, we extract and cache the "id" and "name"
fields, and use them instead of the full string when printing the type
name.

When parsing PROJJSON or WKT2 we just verify that the PROJJSON is valid
json, and that WKT2 is syntactically valid... in whatever encoding it
uses. We don't inspect if the keywords/fields other than those required
to extract the name and id are set correctly.

Two parameterized geometry types are treated as equal if the "id" of
their coordinate system string is equal, otherwise we compare the
"name", and finally compare the full string character by character.

This PR also updates the parquet extension so that the CRS is propagated
to/from the Parquet type and GeoParquet json metadata.

# What is a coordinate reference system (CRS)?

Because geometries are made up of arbitrary coordinates in planar space,
it's important that a dataset can be associated with a coordinate system
so that you can meaningfully interpret what the coordinates actually
represent. Coordinate reference systems are basically the spatial
equivalent of time-zones in the temporal domain. While it would be
convenient if the whole world always operated on coordinates in degrees
of [longitude, latitude] (or always used UTC for timestamps), in
practice it is very common that geospatial data is provided in a
coordinate system defined for a specific local area of use.

## _Where_ are CRS's stored?

DuckDB (in duckdb-spatial) has previously not enabled a built-in way to
associate geometries with their coordinate system, leaving it up to the
user to either keep this sort of metadata in a separate column or track
it "out of band" outside of duckdb itself. Both solutions are somewhat
cumbersome and error prone.

Other databases that support geometries tend to do some combination of:

1. Inline a "spatial reference identifier" integer (SRID) into each
geometry _value_, which can be used to perform a look-up in a separate
(system) table that maps SRID's to coordinate system definitions.
2. Keep a separate metadata table/view (e.g. SPATIAL_REF_SYS) which
defines the coordinate reference system for each geometry _column_ in
the database.

For DuckDB It feels natural that the coordinate system would be set for
a whole column, and not per value. But neither option seems suitable as
they rely on specific system tables to be populated and present, and all
coordinate systems to be known and defined _a priori_ , which would add
a lot of friction to the more ephemeral and multi-data-source workflows
that DuckDB is typically used for.

Therefore we are instead putting the coordinate system into the geometry
_type_ itself, similar to how _data frame_ libraries tend to treat
geometry columns.

E.g. you can now create a table for a geometry column in the coordinate
system defined by the _European Petroleum Survey Group_ with id `4326`
as:

```sql
CREATE TABLE t1(g GEOMETRY('EPSG:4326'));

INSERT INTO t1 VALUES ('POINT (0 1)');

SELECT g, ST_CRS(g) FROM t1;
----
┌─────────────────────┬───────────┐
│          g          │ st_crs(g) │
│ geometry(epsg:4326) │  varchar  │
├─────────────────────┼───────────┤
│ POINT (0 1)         │ EPSG:4326 │
└─────────────────────┴───────────┘
```

Moving the coordinate system into the type system also has other
benefits like being able to error-out at bind-time if users are
attempting to operate on geometries that belong to two different
coordinate systems, and making it possible to infer the coordinate
system of an arbitrary geometry expression when importing/exporting to
different geospatial data formats, without having to do a separate table
lookup.

E.g. trying to mix geometries with different coordinate systems throws a
binder exception.
```sql

INSERT INTO t1 VALUES (ST_SetCRS('POINT (0 1)', 'EPSG:3857'));
----
Binder Error:
Cannot cast GEOMETRY with CRS 'EPSG:3857' to GEOMETRY with different CRS 'EPSG:4326'
```

However, you can always implicitly cast a geometry column _with_ a CRS
both to and from a column _without_ a CRS, because at the end of the day
CRS is just metadata.

## _How_ are CRS's stored?

Here comes the difficult part. 

The modern data model of a coordinate reference system definition is
conceptually defined in `ISO 19111` but is mostly encoded as strings in
`WKT1`, `WKT2` and `PROJJSON` format. These strings can be quite large
and unwieldy to pass around by users directly, hence why e.g. PostGIS
uses integer id (SRID's) to identify coordinate systems within a
Postgres installation instead. Coordinate systems are also commonly
referred by users using a shorthand "AUTH:CODE" format, like e.g.
`EPSG:4326` in the example above.

Similar to integer SRID's, the `AUTH:CODE` shorthand is lossy in that
its not possible to extract the actual "definition" of the coordinate
system without having a separate database or table to identify the auth
code. Since `duckdb-spatial` embeds the `PROJ` library and accompanying
database, it can be used to resolve almost any `AUTH:CODE` (or any
arbitrary string really) to a complete coordinate system definition.

Therefore, it makes a lot of sense for us to simply allow any sort of
string in the CRS field of a geometry type. This is also how the (new)
parquet geometry spec, as well as iceberg and delta deal with this
problem. Just store a string and interpret it when needed, possibly by
referring to some external data source (e.g. PROJ database, or table
property) Since core DuckDB doesn't need to perform any coordinate
transformations, why would we need the actual full definition of a
coordinate system?

Well, the problem arises when external formats that we may want to
import/export to do have requirements on the format of the CRS.
GeoParquet (v1) requires that the CRS is provided in PROJJSON format,
GPKG (The SQLite geo-format) requires WKT1 (or WKT2), and PostGIS
requires WKT1. Therefore, this currently raises an error, and is kinda
bad UX (IMO):

```sql
-- Create a table with some random AUTH:CODE CRS
CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337'));
...
COPY t1 TO 'test_random_crs.parquet';
----
Invalid Input Error: Cannot write GeoParquet V1 metadata for column 'g': GeoParquet only supports PROJJSON CRS definitions
```

While the following is ok, but also comes with its own shitty UX (having
to pass the full PROJJSON string in the type definition)

```sql
CREATE TABLE t1 (g GEOMETRY(' "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json",
  "type": "ProjectedCRS",
  "name": "WGS 84 / Pseudo-Mercator",
  "base_crs": {
  ...
  200 more lines...
  ...
}');

-- Success!
COPY t1 TO 'test_ok_crs.parquet'
```

# Future work 

In practice `spatial` can solve almost all of the aforementioned
problems thanks to the `PROJ` library, but we can't (and don't want to)
always assume the `spatial` extension is loaded.

Therefore in a future PR the plan is to introduce a
`CoordinateReferenceSystemUtil` class that by default can "identify" and
expand some of the most common coordinate systems from their auth code
to a full projjson definition.

The implementation of the `CoordinateReferenceSystemUtil` can then be
overridden by `spatial` when it is loaded. We may also want to consider
splitting out the `PROJ` part from spatial into its own smaller
extension that can be auto-loaded on demand, much like how e.g. the
`icu` extension is auto-loaded for timezones/collations.

One open question is that Im still not sure of is if we want to
"eagerly" try to identify/expand auth-codes/user input to PROJJSON on
e.g. table creation or dataset import, or if we fully defer any
interpretation lazily "until needed" when exporting to an external
dataset that imposes some requirements on the CRS string. Maybe we want
to make this configurable through a setting.
Mytherin added a commit that referenced this pull request Jan 15, 2026
This is a followup PR that builds on top of
#19848 (although orthogonal to
#20143). Please have a look at
#19136 for the context behind this
PR.

This PR enables support for "shredding" geometry columns in our storage
format by internally decomposing the geometries into multiple separate
column segments when all the rows within a row-group are of the same
geometry sub-type. For example, if a row group only contains `POINT`
(XY) geometries, the column segment for that row-group gets rewritten
and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB`
when checkpointing. This encoding is similar to how [GeoArrow]() encodes
typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs
into columnar lists and structs of doubles like this is at the
storage-level is that they compress _significantly_ better in the
decomposed form. By decomposing the geometries into column segment of
primitive types we automatically benefit from DuckDB's existing (and
future!) best-in-class adaptive compression algorithms for each type of
component. E.g. the coordinate vertices get ALP-compressed, and polygon
ring-offsets get RLE/delta/dictionary encoded automatically based on the
distribution of the data.

In my (limited) testing, a fully shredded column is about half the size
as the blob-equivalent, i.e. shredding results in a 2x compression
ratio. This means that you get much faster reads from disk, and will be
able to keep more data in memory (as we now also compress in-memory
segments).

There are currently two caveats: 
- We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they
can be recursive and don't have a fixed layout.
- We only shred if the _entire_ row-group is of a single geometry
sub-type. So if you have 10000 points, insert a linestring and
checkpoint, the column segment will be reverted back to physical type
`BLOB`.
- We don't shred segments containing EMPTY geometries, either at the
root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep
multiple shredded segments around. But Im not sure if the
performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some
interesting future optimizations opportunities:

- Push down certain filters (like bounding-box-intersections) much
deeper into the storage as they are faster to evaluate on separate
coordinate-arrays axis-by-axis.
- Emit "shredded" geometries directly from storage when casting or
exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read
from a fully shredded column have overloads that take the shredded
representation instead (e.g. spatial's `POINT_2D`), replace all
functions (or insert casts where needed) and emit the shredded
representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no
additional storage overhead. In other words, a non-shredded geometry
column is serialized exactly the same way as it used to (before this PR
lands).

I've made some changes to how `ColumnData` is serialized though. There
is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the
`PersistentColumnData` struct that separates out the extra info required
by the `VariantColumnData` and `GeoColumnData`. As they are the only two
column data types where the type/layout can differ between segments
within the same column, they need to store some information to
reconstruct the layout and can't just rely on the top-level column
logical type. In the case of the variant, this is the "shredded" type,
but since geometries have a fixed number of layouts, we only store the
geometry and vertex-sub type (enums) instead of the equivalent logical
type to save space. I expect that we may want to generalize this further
in the future if we start implementing shredding to other types as well
(e.g. strings/json).
d-justen pushed a commit to d-justen/duckdb that referenced this pull request Jan 19, 2026
This is a followup PR that builds on top of
duckdb#19848 (although orthogonal to
duckdb#20143). Please have a look at
duckdb#19136 for the context behind this
PR.

This PR enables support for "shredding" geometry columns in our storage
format by internally decomposing the geometries into multiple separate
column segments when all the rows within a row-group are of the same
geometry sub-type. For example, if a row group only contains `POINT`
(XY) geometries, the column segment for that row-group gets rewritten
and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB`
when checkpointing. This encoding is similar to how [GeoArrow]() encodes
typed geometries.

The major benefit of "decomposing" or "shredding" geometries from blobs
into columnar lists and structs of doubles like this is at the
storage-level is that they compress _significantly_ better in the
decomposed form. By decomposing the geometries into column segment of
primitive types we automatically benefit from DuckDB's existing (and
future!) best-in-class adaptive compression algorithms for each type of
component. E.g. the coordinate vertices get ALP-compressed, and polygon
ring-offsets get RLE/delta/dictionary encoded automatically based on the
distribution of the data.

In my (limited) testing, a fully shredded column is about half the size
as the blob-equivalent, i.e. shredding results in a 2x compression
ratio. This means that you get much faster reads from disk, and will be
able to keep more data in memory (as we now also compress in-memory
segments).

There are currently two caveats: 
- We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they
can be recursive and don't have a fixed layout.
- We only shred if the _entire_ row-group is of a single geometry
sub-type. So if you have 10000 points, insert a linestring and
checkpoint, the column segment will be reverted back to physical type
`BLOB`.
- We don't shred segments containing EMPTY geometries, either at the
root or in sub-geometries.

We may want to consider "partial" shredding in the future, where we keep
multiple shredded segments around. But Im not sure if the
performance/complexity hit would be worth it.

Despite those limitations, geometry shredding still opens up some
interesting future optimizations opportunities:

- Push down certain filters (like bounding-box-intersections) much
deeper into the storage as they are faster to evaluate on separate
coordinate-arrays axis-by-axis.
- Emit "shredded" geometries directly from storage when casting or
exporting to GeoArrow
- Or if the optimizer realizes that all function expressions that read
from a fully shredded column have overloads that take the shredded
representation instead (e.g. spatial's `POINT_2D`), replace all
functions (or insert casts where needed) and emit the shredded
representation from storage to "specialize" queries automatically.

As it stands, In the case where there is no shredding, there is no
additional storage overhead. In other words, a non-shredded geometry
column is serialized exactly the same way as it used to (before this PR
lands).

I've made some changes to how `ColumnData` is serialized though. There
is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the
`PersistentColumnData` struct that separates out the extra info required
by the `VariantColumnData` and `GeoColumnData`. As they are the only two
column data types where the type/layout can differ between segments
within the same column, they need to store some information to
reconstruct the layout and can't just rely on the top-level column
logical type. In the case of the variant, this is the "shredded" type,
but since geometries have a fixed number of layouts, we only store the
geometry and vertex-sub type (enums) instead of the equivalent logical
type to save space. I expect that we may want to generalize this further
in the future if we start implementing shredding to other types as well
(e.g. strings/json).
Mytherin added a commit that referenced this pull request Feb 4, 2026
…20721)

This is a followup PR that builds on top of
#20143, please have a look at
#19136 for the context behind this
PR.

This PR makes additional changes to how coordinate systems are handled
for the `GEOMETRY` type.

## Shrinking, Expansion, and Identification of coordinate systems

In the initial iteration of parameterizing geometry types with
coordinate systems, we basically allowed any string to be stored as the
CRS, and then tried to parse and identify the format (projjson,
wkt2:2019, auth:code, srid) before extracting a "name" or "identifier"
which we stored separately to use when printing the type.

This has the major downside that the textual representation of a
geometry type (or SQL schema containing geometry types) no longer
round-trips. I.e. if you parse it back, you no longer get the same type.
This is primarily a problem when doing a `EXPORT DATABASE`, `SUMMARIZE`
or calling `.schema` in the shell. However, the alternative of always
printing the full definition is also... untenable as it makes the SQL
extremely unfriendly to read.

The compromise implemented in this PR is to alway print what's actually
stored in the type info, _but_ also try to "shrink" the actual CRS
definition to e.g. its `auth:code` when parsing a CRS, _if_ the
definition is a CRS that we recognize (and should therefore be able to
"expand" into a full definition again later).

As an example: 
- If we read a GeoParquet file, which has the default coordinate system
"CRS84" in projjson format, we only store "OGC:CRS84" in the geometry
type, because DuckDB knows how to convert "OGC:CRS84" back into the
projjson (or WKT2) later if needed.
- But if we read a GeoParquet file with some other unrecognized CRS
definition (e.g. "SOME_ORG:1337", but in projjson) we store the full
projjson, and simply live with the fact that the type will be hideous to
display.

We also by-default now throw an error if we try to create a geometry
type with an _incomplete_ unrecognized CRS. I.e. a auth:code or opaque
identifier. We always allow PROJJSON or WKT2 definitions even if we
don't recognize them, as they are complete in the sense that they can be
interpreted on their own, but we don't shrink them if we don't know
them. This handling of unrecognized coordinate system identifiers can be
controlled with the `ignore_unknown_crs` setting.

This means that you can still just pass around complete projjson or wkt2
definitions and deal with the ugliness if you really want to use your
own custom coordinate systems, but in practice 99.9% of coordinate
systems will be recognized by `spatial`.

While you can't define your own "known" coordinate systems through SQL,
you can do it through your own extension (or application that embeds
DuckDB) by providing instances of the new `CoordinateSystemCatalogEntry`
in the system catalog.

##  Coordinate System Catalog Entries

There is now a new type of catalog entry to store coordinate system
definitions, the `CoordinateSystemCatalogEntry`. These can be registered
by extensions to provide additional coordinate system definitions. For
example, the `spatial` extension now registers its list of EPSG and
OGC-defined coordinate systems by lazily pulling them from the embedded
`PROJ` library.

But this PR also adds "OGC:CRS84" and "OGC:CRS83" definitions in core.
This list of built-in definitions may or may not be extended in the
future. Or we may create a separate dedicated extension that only
supplies coordinate system definitions (similar to `icu` and
`encodings`).

## Support for CRS propagation through (Geo)Arrow import/export

This PR also adds support for propagating the CRS when
exporting/importing from (Geo)Arrow. I had to make some changes to
drill-down the client context into the arrow extension code, but we
always have it available when resolving extension types anyway so the
changes only really touch the internals.

A nice consequence of this is that `spatial`:s `GDAL` integration
automatically handles CRS propagation now too as its based on arrow,
meaning that `ST_Read()` outputs `GEOMETRY` columns with the CRS
specified by the underlying file, and `COPY ... TO (FORMAT GDAL)` also
encodes the CRS properly.

## Update `spatial` to v1.5 Branch

This PR also adds back and bumps spatial to the v1.5 branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants