GEOMETRY Rework: Part 1 - Logical Type#19136
Merged
hannes merged 6 commits intoduckdb:mainfrom Sep 30, 2025
Merged
Conversation
Collaborator
|
Thanks! Could you merge with main again? |
07d81e2 to
8a472b7
Compare
f78f0b7 to
fe1a528
Compare
Member
Author
|
@Mytherin Done! |
Member
|
thanks! |
hannes
added a commit
that referenced
this pull request
Oct 8, 2025
This is a followup PR that builds on top of #19136. Please have a look at #19136 for the context behind this PR. This PR adds a new statistics type, the `GeometryStatistics` which keeps track of the spatial extent and the list of geometry subtypes within a column. This can be used in the future to push down certain spatial filters into storage to greatly accelerate data retrieval by skipping row groups that cant intersect with a spatial predicate, or wont contain geometries of a specific type. It can also be used when writing to storage to select specialized compression methods, or at planning time to swap e.g. scalar functions with implementations specialized for certain geometry layouts. The parquet reader already had its own notion of "Geometry Stats", introduced in #18832, but this has now been unified to use the same structs/logics as the one in core, which significantly reduces the spatial-specific code in the parquet extension. Following the roadmap outlined in #19136, the next PR will actually add support for filter pushdown of bounding-box intersection queries, by implementing the `&&` intersection operator and using these new stats to prune row groups in our storage. Ill also try to fixup the parquet reader while I'm at it so that it can output the new `GEOMETRY` type without requiring `spatial`.
paleolimbot
reviewed
Oct 9, 2025
Comment on lines
+444
to
+446
| if (byte_order != 1) { | ||
| throw InvalidInputException("Unsupported byte order %d in WKB", byte_order); | ||
| } |
Contributor
There was a problem hiding this comment.
Perhaps you rewrite WKB when reading from Parquet before it will get here, but unfortunately the default on JTS is big endian which resulted in the widely used Overture maps data being published in big endian wkb.
Comment on lines
+488
to
+489
| ; | ||
| writer.Write(flag_str); |
Comment on lines
+76
to
+78
| (4, 'LINESTRING Z (0 0 0, 1 1 1, 2 2 2)'), | ||
| (5, 'LINESTRING M (0 0 0, 1 1 1, 2 2 2)'), | ||
| (6, 'LINESTRING ZM (0 0 0 0, 1 1 1 1, 2 2 2 2)'), |
Contributor
There was a problem hiding this comment.
These test cases would be slightly better if Z and M values were different (I have personally been burned by mixing these up and having a test case that wasn't able to catch it)
krlmlr
added a commit
to krlmlr/duckdb-r
that referenced
this pull request
Oct 21, 2025
`GEOMETRY` Rework: Part 1 - Logical Type (duckdb/duckdb#19136)
lnkuiper
added a commit
that referenced
this pull request
Oct 27, 2025
This is a followup PR that builds on top of #19203. Please have a look at #19136 for the context behind this PR. In #19203 we added support for storing geometry statistics. In this PR we now add a `&&` bounding-box-intersection binary operator that when used in a filter can be pushed down into storage and prune row-groups based on these new geometry statistics. Geometry-filter pushdown works for our own storage format, as well as parquet (if the parquet file contains parquet-native geometry stats). I've also cleaned up the parquet extension further and removed all spatial-dependent code, so that all geoparquet related functionality now works without requiring spatial to be loaded, which simplifies both the read and write-path for geometries significantly. It also enables us to run the geoparquet tests in CI when spatial isn't available. I've also added a couple of basic geometry scalar-functions to convert to/from WKB and WKT. These will be more fleshed out in the future (to e.g. handle big-endian WKB and specify WKT precision).
krlmlr
added a commit
to krlmlr/duckdb-r
that referenced
this pull request
Nov 1, 2025
`GEOMETRY` Rework: Part 1 - Logical Type (duckdb/duckdb#19136)
krlmlr
added a commit
to krlmlr/duckdb-r
that referenced
this pull request
Nov 2, 2025
`GEOMETRY` Rework: Part 1 - Logical Type (duckdb/duckdb#19136)
krlmlr
added a commit
to krlmlr/duckdb-r
that referenced
this pull request
Nov 2, 2025
`GEOMETRY` Rework: Part 1 - Logical Type (duckdb/duckdb#19136)
Mytherin
added a commit
that referenced
this pull request
Nov 7, 2025
…rt (#19476) This is a followup PR that builds on top of #19439. Please have a look at #19136 for the context behind this PR. This PR fixes up the remaining issues in the parquet extension related to geometries. When reading geometry columns we now push an expression column reader on top of the underlying blob column reader to perform the WKB parsing with `ST_GeomFromWKB`. `ST_GeomFromWKB` now actually checks that the input is valid WKB and also converts from big-endian WKB to little-endian If required. This can be optimized further, but It's good enough for now. I've also added support for converting geometry columns to/from arrow arrays with geoarrow extension metadata. This code is basically lifted [straight from the spatial extension](https://github.com/duckdb/duckdb-spatial/blob/v1.4-andium/src/spatial/spatial_geoarrow.cpp).
Mytherin
added a commit
that referenced
this pull request
Nov 20, 2025
…19848) This is a followup PR that builds on top of #19476. Please have a look at #19136 for the context behind this PR. I realized I the `Geometry::FromBinary`/`Geometry::ToBinary` helper functions need to be adjusted slightly so that they can be used to implement the cast functions provided in `duckdb-spatial`. These casts may move to core eventually, but for now this is required to integrate the spatial extension with the new geometry type smoothly.
lnkuiper
added a commit
that referenced
this pull request
Dec 30, 2025
This is a followup PR that builds on top of #19848. Please have a look at #19136 for the context behind this PR. This PR adds initial support for parameterizing the `GEOMETRY` type by a _coordinate reference system_, along with a pair of `ST_CRS` and `ST_GetCRS` scalar functions to set/get this parameter at the type level. The coordinate reference system type parameter is an arbitrary string, but we try to parse it to identify if it is in WKT2, PROJJSON or AUTH:CODE format. If it is, we extract and cache the "id" and "name" fields, and use them instead of the full string when printing the type name. When parsing PROJJSON or WKT2 we just verify that the PROJJSON is valid json, and that WKT2 is syntactically valid... in whatever encoding it uses. We don't inspect if the keywords/fields other than those required to extract the name and id are set correctly. Two parameterized geometry types are treated as equal if the "id" of their coordinate system string is equal, otherwise we compare the "name", and finally compare the full string character by character. This PR also updates the parquet extension so that the CRS is propagated to/from the Parquet type and GeoParquet json metadata. # What is a coordinate reference system (CRS)? Because geometries are made up of arbitrary coordinates in planar space, it's important that a dataset can be associated with a coordinate system so that you can meaningfully interpret what the coordinates actually represent. Coordinate reference systems are basically the spatial equivalent of time-zones in the temporal domain. While it would be convenient if the whole world always operated on coordinates in degrees of [longitude, latitude] (or always used UTC for timestamps), in practice it is very common that geospatial data is provided in a coordinate system defined for a specific local area of use. ## _Where_ are CRS's stored? DuckDB (in duckdb-spatial) has previously not enabled a built-in way to associate geometries with their coordinate system, leaving it up to the user to either keep this sort of metadata in a separate column or track it "out of band" outside of duckdb itself. Both solutions are somewhat cumbersome and error prone. Other databases that support geometries tend to do some combination of: 1. Inline a "spatial reference identifier" integer (SRID) into each geometry _value_, which can be used to perform a look-up in a separate (system) table that maps SRID's to coordinate system definitions. 2. Keep a separate metadata table/view (e.g. SPATIAL_REF_SYS) which defines the coordinate reference system for each geometry _column_ in the database. For DuckDB It feels natural that the coordinate system would be set for a whole column, and not per value. But neither option seems suitable as they rely on specific system tables to be populated and present, and all coordinate systems to be known and defined _a priori_ , which would add a lot of friction to the more ephemeral and multi-data-source workflows that DuckDB is typically used for. Therefore we are instead putting the coordinate system into the geometry _type_ itself, similar to how _data frame_ libraries tend to treat geometry columns. E.g. you can now create a table for a geometry column in the coordinate system defined by the _European Petroleum Survey Group_ with id `4326` as: ```sql CREATE TABLE t1(g GEOMETRY('EPSG:4326')); INSERT INTO t1 VALUES ('POINT (0 1)'); SELECT g, ST_CRS(g) FROM t1; ---- ┌─────────────────────┬───────────┐ │ g │ st_crs(g) │ │ geometry(epsg:4326) │ varchar │ ├─────────────────────┼───────────┤ │ POINT (0 1) │ EPSG:4326 │ └─────────────────────┴───────────┘ ``` Moving the coordinate system into the type system also has other benefits like being able to error-out at bind-time if users are attempting to operate on geometries that belong to two different coordinate systems, and making it possible to infer the coordinate system of an arbitrary geometry expression when importing/exporting to different geospatial data formats, without having to do a separate table lookup. E.g. trying to mix geometries with different coordinate systems throws a binder exception. ```sql INSERT INTO t1 VALUES (ST_SetCRS('POINT (0 1)', 'EPSG:3857')); ---- Binder Error: Cannot cast GEOMETRY with CRS 'EPSG:3857' to GEOMETRY with different CRS 'EPSG:4326' ``` However, you can always implicitly cast a geometry column _with_ a CRS both to and from a column _without_ a CRS, because at the end of the day CRS is just metadata. ## _How_ are CRS's stored? Here comes the difficult part. The modern data model of a coordinate reference system definition is conceptually defined in `ISO 19111` but is mostly encoded as strings in `WKT1`, `WKT2` and `PROJJSON` format. These strings can be quite large and unwieldy to pass around by users directly, hence why e.g. PostGIS uses integer id (SRID's) to identify coordinate systems within a Postgres installation instead. Coordinate systems are also commonly referred by users using a shorthand "AUTH:CODE" format, like e.g. `EPSG:4326` in the example above. Similar to integer SRID's, the `AUTH:CODE` shorthand is lossy in that its not possible to extract the actual "definition" of the coordinate system without having a separate database or table to identify the auth code. Since `duckdb-spatial` embeds the `PROJ` library and accompanying database, it can be used to resolve almost any `AUTH:CODE` (or any arbitrary string really) to a complete coordinate system definition. Therefore, it makes a lot of sense for us to simply allow any sort of string in the CRS field of a geometry type. This is also how the (new) parquet geometry spec, as well as iceberg and delta deal with this problem. Just store a string and interpret it when needed, possibly by referring to some external data source (e.g. PROJ database, or table property) Since core DuckDB doesn't need to perform any coordinate transformations, why would we need the actual full definition of a coordinate system? Well, the problem arises when external formats that we may want to import/export to do have requirements on the format of the CRS. GeoParquet (v1) requires that the CRS is provided in PROJJSON format, GPKG (The SQLite geo-format) requires WKT1 (or WKT2), and PostGIS requires WKT1. Therefore, this currently raises an error, and is kinda bad UX (IMO): ```sql -- Create a table with some random AUTH:CODE CRS CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337')); ... COPY t1 TO 'test_random_crs.parquet'; ---- Invalid Input Error: Cannot write GeoParquet V1 metadata for column 'g': GeoParquet only supports PROJJSON CRS definitions ``` While the following is ok, but also comes with its own shitty UX (having to pass the full PROJJSON string in the type definition) ```sql CREATE TABLE t1 (g GEOMETRY(' "$schema": "https://proj.org/schemas/v0.7/projjson.schema.json", "type": "ProjectedCRS", "name": "WGS 84 / Pseudo-Mercator", "base_crs": { ... 200 more lines... ... }'); -- Success! COPY t1 TO 'test_ok_crs.parquet' ``` # Future work In practice `spatial` can solve almost all of the aforementioned problems thanks to the `PROJ` library, but we can't (and don't want to) always assume the `spatial` extension is loaded. Therefore in a future PR the plan is to introduce a `CoordinateReferenceSystemUtil` class that by default can "identify" and expand some of the most common coordinate systems from their auth code to a full projjson definition. The implementation of the `CoordinateReferenceSystemUtil` can then be overridden by `spatial` when it is loaded. We may also want to consider splitting out the `PROJ` part from spatial into its own smaller extension that can be auto-loaded on demand, much like how e.g. the `icu` extension is auto-loaded for timezones/collations. One open question is that Im still not sure of is if we want to "eagerly" try to identify/expand auth-codes/user input to PROJJSON on e.g. table creation or dataset import, or if we fully defer any interpretation lazily "until needed" when exporting to an external dataset that imposes some requirements on the CRS string. Maybe we want to make this configurable through a setting.
Mytherin
added a commit
that referenced
this pull request
Jan 15, 2026
This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
d-justen
pushed a commit
to d-justen/duckdb
that referenced
this pull request
Jan 19, 2026
This is a followup PR that builds on top of duckdb#19848 (although orthogonal to duckdb#20143). Please have a look at duckdb#19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
Mytherin
added a commit
that referenced
this pull request
Feb 4, 2026
…20721) This is a followup PR that builds on top of #20143, please have a look at #19136 for the context behind this PR. This PR makes additional changes to how coordinate systems are handled for the `GEOMETRY` type. ## Shrinking, Expansion, and Identification of coordinate systems In the initial iteration of parameterizing geometry types with coordinate systems, we basically allowed any string to be stored as the CRS, and then tried to parse and identify the format (projjson, wkt2:2019, auth:code, srid) before extracting a "name" or "identifier" which we stored separately to use when printing the type. This has the major downside that the textual representation of a geometry type (or SQL schema containing geometry types) no longer round-trips. I.e. if you parse it back, you no longer get the same type. This is primarily a problem when doing a `EXPORT DATABASE`, `SUMMARIZE` or calling `.schema` in the shell. However, the alternative of always printing the full definition is also... untenable as it makes the SQL extremely unfriendly to read. The compromise implemented in this PR is to alway print what's actually stored in the type info, _but_ also try to "shrink" the actual CRS definition to e.g. its `auth:code` when parsing a CRS, _if_ the definition is a CRS that we recognize (and should therefore be able to "expand" into a full definition again later). As an example: - If we read a GeoParquet file, which has the default coordinate system "CRS84" in projjson format, we only store "OGC:CRS84" in the geometry type, because DuckDB knows how to convert "OGC:CRS84" back into the projjson (or WKT2) later if needed. - But if we read a GeoParquet file with some other unrecognized CRS definition (e.g. "SOME_ORG:1337", but in projjson) we store the full projjson, and simply live with the fact that the type will be hideous to display. We also by-default now throw an error if we try to create a geometry type with an _incomplete_ unrecognized CRS. I.e. a auth:code or opaque identifier. We always allow PROJJSON or WKT2 definitions even if we don't recognize them, as they are complete in the sense that they can be interpreted on their own, but we don't shrink them if we don't know them. This handling of unrecognized coordinate system identifiers can be controlled with the `ignore_unknown_crs` setting. This means that you can still just pass around complete projjson or wkt2 definitions and deal with the ugliness if you really want to use your own custom coordinate systems, but in practice 99.9% of coordinate systems will be recognized by `spatial`. While you can't define your own "known" coordinate systems through SQL, you can do it through your own extension (or application that embeds DuckDB) by providing instances of the new `CoordinateSystemCatalogEntry` in the system catalog. ## Coordinate System Catalog Entries There is now a new type of catalog entry to store coordinate system definitions, the `CoordinateSystemCatalogEntry`. These can be registered by extensions to provide additional coordinate system definitions. For example, the `spatial` extension now registers its list of EPSG and OGC-defined coordinate systems by lazily pulling them from the embedded `PROJ` library. But this PR also adds "OGC:CRS84" and "OGC:CRS83" definitions in core. This list of built-in definitions may or may not be extended in the future. Or we may create a separate dedicated extension that only supplies coordinate system definitions (similar to `icu` and `encodings`). ## Support for CRS propagation through (Geo)Arrow import/export This PR also adds support for propagating the CRS when exporting/importing from (Geo)Arrow. I had to make some changes to drill-down the client context into the arrow extension code, but we always have it available when resolving extension types anyway so the changes only really touch the internals. A nice consequence of this is that `spatial`:s `GDAL` integration automatically handles CRS propagation now too as its based on arrow, meaning that `ST_Read()` outputs `GEOMETRY` columns with the CRS specified by the underlying file, and `COPY ... TO (FORMAT GDAL)` also encodes the CRS properly. ## Update `spatial` to v1.5 Branch This PR also adds back and bumps spatial to the v1.5 branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a dedicated
GEOMETRYlogical type into core DuckDB. The internal representation is currently WKB-encodedBLOB, but that will likely change in a future PR. No functions are implemented for this type, except to/fromVARCHARcasts.This is the first PR in a long series of changes thats going to be pushed the coming weeks, with the ultimate goal of significantly elevating DuckDBs geospatial capabilities for the DuckDB v1.5 release in early 2026.
Background
So far DuckDB has (mostly) contained all geospatial features in the
spatialextension. This has worked great as it has allowed us to independently and rapidly experiment with how to adapt geospatial processing to DuckDBs engine, while also keeping a lot of the domain-specific details and dependencies separate from core DuckDBs core codebase. Besides integrating a lot of third-party geospatial libraries,spatialhas also been integrating deeply into DuckDBs core execution engine and made itself dependent on a lot of fragile interfaces within DuckDBs internals (custom operators, optimizer rules, indexes, etc). Fast forward to today, andspatialis one of DuckDBs largest and most complex extensions.While this complexity has been somewhat manageable so far, we're now reaching a point where it's no longer feasible to relegate all geospatial related stuff into a single separate extension. There's already some awkwardness when dealing with e.g. Pandas/Postgres/SQLite through DuckDB, which have their own geospatial extensions (GeoPandas, PostGIS, GeoPackage/SpatiaLite) as core DuckDB doesn't really want to acknowledge anything spatial specific and we don't want to introduce inter-extension dependencies either. But geospatial support is now also part of the parquet standard itself, which is of much higher importance to DuckDB, as well as supported by all up and coming data lake formats.
In short: Geospatial data Isn't special (anymore)
What's changing
Therefore we're taking some steps to making vanilla DuckDB spatial aware, by moving the
GEOMETRYtype fromspatialinto core DuckDB.While almost all of the geospatial functionality will still remain in
spatial(e.g. 99% ofST_functions), this will give our (and community!) extensions and client libraries some common ground as they can all interface with the sameGEOMETRYtype. We will also make sure that existing databases that use theGEOMETRYtype as currently defined inspatialwill remain compatible.Additionally, because
GEOMETRYwill now become part of both DuckDBs execution and storage engine, this opens up a lot of optimization opportunities that are currently impractical/impossible to implement solely inspatial. The two big ones being statistics propagation and compression, which will significantly improve performance of processing both external formats like (Geo)Parquet and DuckDBs own storage format.Again, this is a pretty massive change. I have prototyped most of it on my own fork(s), but will break it up into multiple PR's keep it manageable. The rough short-term roadmap looks something like this:
GEOMETRYto coreGEOMETRYRework: Part 2 - Statistics #19203)GEOMETRYRework: Part 3 - Filter pushdown #19439)parquetextension (implemented inGEOMETRYRework: Part 4 - Fixup Parquet Extension + Add Arrow Support #19476)spatialextensionGEOMETRYRework: Part 6 - Geometry "Shredding" #20281)GEOMETRYRework: Part 5 - Coordinate Reference System Support #20143)GEOGRAPHY/"vectorized types" tooClient/other extension integrations etc, etc, is planned to get in before 1.5 as well, but we will first focus on core/parquet/spatial.
This PR also removes spatial from the CI workflow until I've had time adapt it to these changes, but that will hopefully not take too long (I have an old branch with most of the work already)