GEOMETRY Rework: Part 5 - Coordinate Reference System Support#20143
GEOMETRY Rework: Part 5 - Coordinate Reference System Support#20143lnkuiper merged 11 commits intoduckdb:mainfrom
GEOMETRY Rework: Part 5 - Coordinate Reference System Support#20143Conversation
f143887 to
4a4247d
Compare
GEOMETRY Rework: Part 5.1 - Coordinate Reference System SupportGEOMETRY Rework: Part 5 - Coordinate Reference System Support
lnkuiper
left a comment
There was a problem hiding this comment.
Thanks for the PR! Looks great. I have a few comments regarding testing. I think you probably have more testing in the Geo extension, but since this is now in main, would it make sense to test the behaviour that is in main with tests here?
|
On the future work part:
A curiosity I have is: One path I could see is allowing something like: CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337'));
---- Error: 'DUCKDB:1337' not recognized as CRS, consider 'CALL enable_crs'
CALL register_crs('DUCKDB:1337');
---- registers 'DUCKDB:1337' as valid CRS
CREATE TABLE t1 (g GEOMETRY('DUCKDB:1337'));
---- works
COPY t1 TO 'test_ok_crs.parquet';
---- Error CRS 'DUCKDB:1337' is enabled for internal usage but not valid for exporting to parquet due to missing detail on registrationwhere One doubt, if going with pre-validation, is whether validation should be encoded as part of the DB file, like |
|
@carlopi Yeah this is definitely an open question. But there's two aspects to this:
|
|
Im also considering adding a setting such as But this requires us to have the client context during type binding/parsing, which ties into my other prototype work on allowing expressions when declaring types. |
1c7cf96 to
629283b
Compare
|
I guess one lateral point of my earlier comment that I think could make sense is that the CRSUtil can even just be non overridden, but have an API like: // Check map of CRS
ResultObj CRSUtil::TryIdentify(const string& name);
// Add `name` to map of CRS
void CRSUtil::RegisterCRS(const string& name, ExtraOptions& options);and have both core and spatial (and eventually other extensions) just do a bunch of RegisterCRS() calls (possibly static, like in core, or dynamic depending on PROJ database in spatial), without actually nuking default object and reimplementing it, but then I am not sure if this makes complete sense. Problem with Or maybe allow overriding based on regex on the name, if that makes sense, like that TryIdentify() acts like the VirtualFileSystem, so first does a regex and then decides who to handle further work. |
|
Thats a good point, but in practice spatial would add like... thousands of CRS's. Maybe thats ok, Im a bit worried about extension load-time being impacted but would have to test. Also not sure exactly how the util interface will look - while I don't think we need to do transformations, there are probably other stuff we may want to put on the interface - like "convert from PROJJSON to WKT2" or whatnot. But we will see. I did experiment previously with creating something like a generic "Service" interface/dependency-injection container which would enable patterns where you override but don't replace (i.e. extend) existing abstract services but didn't get far enough to push it. I think I got stuck on how to resolve auto-loading for service implementations provided by extensions (like encryption-util, we have one by default already but always try to autoload httpfs instead), but would like to revisit it eventually. |
|
Thanks for the changes! There are a few failing CI runs left before this can be merged: Fails with verification enabled: Not forwards compatible (Forwards compatibility tests): |
|
@lnkuiper all green |
|
Thanks! |
|
👏 |
This is a followup PR that builds on top of #19848 (although orthogonal to #20143). Please have a look at #19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
This is a followup PR that builds on top of duckdb#19848 (although orthogonal to duckdb#20143). Please have a look at duckdb#19136 for the context behind this PR. This PR enables support for "shredding" geometry columns in our storage format by internally decomposing the geometries into multiple separate column segments when all the rows within a row-group are of the same geometry sub-type. For example, if a row group only contains `POINT` (XY) geometries, the column segment for that row-group gets rewritten and stored internally as `STRUCT(x DOUBLE, y DOUBLE)` instead of `BLOB` when checkpointing. This encoding is similar to how [GeoArrow]() encodes typed geometries. The major benefit of "decomposing" or "shredding" geometries from blobs into columnar lists and structs of doubles like this is at the storage-level is that they compress _significantly_ better in the decomposed form. By decomposing the geometries into column segment of primitive types we automatically benefit from DuckDB's existing (and future!) best-in-class adaptive compression algorithms for each type of component. E.g. the coordinate vertices get ALP-compressed, and polygon ring-offsets get RLE/delta/dictionary encoded automatically based on the distribution of the data. In my (limited) testing, a fully shredded column is about half the size as the blob-equivalent, i.e. shredding results in a 2x compression ratio. This means that you get much faster reads from disk, and will be able to keep more data in memory (as we now also compress in-memory segments). There are currently two caveats: - We don't shred segments that contain `GEOMETRYCOLLECTION`s, as they can be recursive and don't have a fixed layout. - We only shred if the _entire_ row-group is of a single geometry sub-type. So if you have 10000 points, insert a linestring and checkpoint, the column segment will be reverted back to physical type `BLOB`. - We don't shred segments containing EMPTY geometries, either at the root or in sub-geometries. We may want to consider "partial" shredding in the future, where we keep multiple shredded segments around. But Im not sure if the performance/complexity hit would be worth it. Despite those limitations, geometry shredding still opens up some interesting future optimizations opportunities: - Push down certain filters (like bounding-box-intersections) much deeper into the storage as they are faster to evaluate on separate coordinate-arrays axis-by-axis. - Emit "shredded" geometries directly from storage when casting or exporting to GeoArrow - Or if the optimizer realizes that all function expressions that read from a fully shredded column have overloads that take the shredded representation instead (e.g. spatial's `POINT_2D`), replace all functions (or insert casts where needed) and emit the shredded representation from storage to "specialize" queries automatically. As it stands, In the case where there is no shredding, there is no additional storage overhead. In other words, a non-shredded geometry column is serialized exactly the same way as it used to (before this PR lands). I've made some changes to how `ColumnData` is serialized though. There is now an abstract `unique_ptr<ExtraPersistentColumnData>` in the `PersistentColumnData` struct that separates out the extra info required by the `VariantColumnData` and `GeoColumnData`. As they are the only two column data types where the type/layout can differ between segments within the same column, they need to store some information to reconstruct the layout and can't just rely on the top-level column logical type. In the case of the variant, this is the "shredded" type, but since geometries have a fixed number of layouts, we only store the geometry and vertex-sub type (enums) instead of the equivalent logical type to save space. I expect that we may want to generalize this further in the future if we start implementing shredding to other types as well (e.g. strings/json).
…20721) This is a followup PR that builds on top of #20143, please have a look at #19136 for the context behind this PR. This PR makes additional changes to how coordinate systems are handled for the `GEOMETRY` type. ## Shrinking, Expansion, and Identification of coordinate systems In the initial iteration of parameterizing geometry types with coordinate systems, we basically allowed any string to be stored as the CRS, and then tried to parse and identify the format (projjson, wkt2:2019, auth:code, srid) before extracting a "name" or "identifier" which we stored separately to use when printing the type. This has the major downside that the textual representation of a geometry type (or SQL schema containing geometry types) no longer round-trips. I.e. if you parse it back, you no longer get the same type. This is primarily a problem when doing a `EXPORT DATABASE`, `SUMMARIZE` or calling `.schema` in the shell. However, the alternative of always printing the full definition is also... untenable as it makes the SQL extremely unfriendly to read. The compromise implemented in this PR is to alway print what's actually stored in the type info, _but_ also try to "shrink" the actual CRS definition to e.g. its `auth:code` when parsing a CRS, _if_ the definition is a CRS that we recognize (and should therefore be able to "expand" into a full definition again later). As an example: - If we read a GeoParquet file, which has the default coordinate system "CRS84" in projjson format, we only store "OGC:CRS84" in the geometry type, because DuckDB knows how to convert "OGC:CRS84" back into the projjson (or WKT2) later if needed. - But if we read a GeoParquet file with some other unrecognized CRS definition (e.g. "SOME_ORG:1337", but in projjson) we store the full projjson, and simply live with the fact that the type will be hideous to display. We also by-default now throw an error if we try to create a geometry type with an _incomplete_ unrecognized CRS. I.e. a auth:code or opaque identifier. We always allow PROJJSON or WKT2 definitions even if we don't recognize them, as they are complete in the sense that they can be interpreted on their own, but we don't shrink them if we don't know them. This handling of unrecognized coordinate system identifiers can be controlled with the `ignore_unknown_crs` setting. This means that you can still just pass around complete projjson or wkt2 definitions and deal with the ugliness if you really want to use your own custom coordinate systems, but in practice 99.9% of coordinate systems will be recognized by `spatial`. While you can't define your own "known" coordinate systems through SQL, you can do it through your own extension (or application that embeds DuckDB) by providing instances of the new `CoordinateSystemCatalogEntry` in the system catalog. ## Coordinate System Catalog Entries There is now a new type of catalog entry to store coordinate system definitions, the `CoordinateSystemCatalogEntry`. These can be registered by extensions to provide additional coordinate system definitions. For example, the `spatial` extension now registers its list of EPSG and OGC-defined coordinate systems by lazily pulling them from the embedded `PROJ` library. But this PR also adds "OGC:CRS84" and "OGC:CRS83" definitions in core. This list of built-in definitions may or may not be extended in the future. Or we may create a separate dedicated extension that only supplies coordinate system definitions (similar to `icu` and `encodings`). ## Support for CRS propagation through (Geo)Arrow import/export This PR also adds support for propagating the CRS when exporting/importing from (Geo)Arrow. I had to make some changes to drill-down the client context into the arrow extension code, but we always have it available when resolving extension types anyway so the changes only really touch the internals. A nice consequence of this is that `spatial`:s `GDAL` integration automatically handles CRS propagation now too as its based on arrow, meaning that `ST_Read()` outputs `GEOMETRY` columns with the CRS specified by the underlying file, and `COPY ... TO (FORMAT GDAL)` also encodes the CRS properly. ## Update `spatial` to v1.5 Branch This PR also adds back and bumps spatial to the v1.5 branch.
`GEOMETRY` Rework: Part 5 - Coordinate Reference System Support (duckdb/duckdb#20143) feat(adbc): support `ADBC_INFO_DRIVER_ADBC_VERSION` (new in ADBC 1.1.0) (duckdb/duckdb#20344) feat(adbc): support the uri option of ADBC 1.1.0 (duckdb/duckdb#20312)
`GEOMETRY` Rework: Part 5 - Coordinate Reference System Support (duckdb/duckdb#20143) feat(adbc): support `ADBC_INFO_DRIVER_ADBC_VERSION` (new in ADBC 1.1.0) (duckdb/duckdb#20344) feat(adbc): support the uri option of ADBC 1.1.0 (duckdb/duckdb#20312) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
This is a followup PR that builds on top of #19848. Please have a look at #19136 for the context behind this PR.
This PR adds initial support for parameterizing the
GEOMETRYtype by a coordinate reference system, along with a pair ofST_CRSandST_GetCRSscalar functions to set/get this parameter at the type level.The coordinate reference system type parameter is an arbitrary string, but we try to parse it to identify if it is in WKT2, PROJJSON or AUTH:CODE format. If it is, we extract and cache the "id" and "name" fields, and use them instead of the full string when printing the type name.
When parsing PROJJSON or WKT2 we just verify that the PROJJSON is valid json, and that WKT2 is syntactically valid... in whatever encoding it uses. We don't inspect if the keywords/fields other than those required to extract the name and id are set correctly.
Two parameterized geometry types are treated as equal if the "id" of their coordinate system string is equal, otherwise we compare the "name", and finally compare the full string character by character.
This PR also updates the parquet extension so that the CRS is propagated to/from the Parquet type and GeoParquet json metadata.
What is a coordinate reference system (CRS)?
Because geometries are made up of arbitrary coordinates in planar space, it's important that a dataset can be associated with a coordinate system so that you can meaningfully interpret what the coordinates actually represent. Coordinate reference systems are basically the spatial equivalent of time-zones in the temporal domain. While it would be convenient if the whole world always operated on coordinates in degrees of [longitude, latitude] (or always used UTC for timestamps), in practice it is very common that geospatial data is provided in a coordinate system defined for a specific local area of use.
Where are CRS's stored?
DuckDB (in duckdb-spatial) has previously not enabled a built-in way to associate geometries with their coordinate system, leaving it up to the user to either keep this sort of metadata in a separate column or track it "out of band" outside of duckdb itself. Both solutions are somewhat cumbersome and error prone.
Other databases that support geometries tend to do some combination of:
For DuckDB It feels natural that the coordinate system would be set for a whole column, and not per value. But neither option seems suitable as they rely on specific system tables to be populated and present, and all coordinate systems to be known and defined a priori , which would add a lot of friction to the more ephemeral and multi-data-source workflows that DuckDB is typically used for.
Therefore we are instead putting the coordinate system into the geometry type itself, similar to how data frame libraries tend to treat geometry columns.
E.g. you can now create a table for a geometry column in the coordinate system defined by the European Petroleum Survey Group with id
4326as:Moving the coordinate system into the type system also has other benefits like being able to error-out at bind-time if users are attempting to operate on geometries that belong to two different coordinate systems, and making it possible to infer the coordinate system of an arbitrary geometry expression when importing/exporting to different geospatial data formats, without having to do a separate table lookup.
E.g. trying to mix geometries with different coordinate systems throws a binder exception.
However, you can always implicitly cast a geometry column with a CRS both to and from a column without a CRS, because at the end of the day CRS is just metadata.
How are CRS's stored?
Here comes the difficult part.
The modern data model of a coordinate reference system definition is conceptually defined in
ISO 19111but is mostly encoded as strings inWKT1,WKT2andPROJJSONformat. These strings can be quite large and unwieldy to pass around by users directly, hence why e.g. PostGIS uses integer id (SRID's) to identify coordinate systems within a Postgres installation instead. Coordinate systems are also commonly referred by users using a shorthand "AUTH:CODE" format, like e.g.EPSG:4326in the example above.Similar to integer SRID's, the
AUTH:CODEshorthand is lossy in that its not possible to extract the actual "definition" of the coordinate system without having a separate database or table to identify the auth code. Sinceduckdb-spatialembeds thePROJlibrary and accompanying database, it can be used to resolve almost anyAUTH:CODE(or any arbitrary string really) to a complete coordinate system definition.Therefore, it makes a lot of sense for us to simply allow any sort of string in the CRS field of a geometry type. This is also how the (new) parquet geometry spec, as well as iceberg and delta deal with this problem. Just store a string and interpret it when needed, possibly by referring to some external data source (e.g. PROJ database, or table property) Since core DuckDB doesn't need to perform any coordinate transformations, why would we need the actual full definition of a coordinate system?
Well, the problem arises when external formats that we may want to import/export to do have requirements on the format of the CRS. GeoParquet (v1) requires that the CRS is provided in PROJJSON format, GPKG (The SQLite geo-format) requires WKT1 (or WKT2), and PostGIS requires WKT1. Therefore, this currently raises an error, and is kinda bad UX (IMO):
While the following is ok, but also comes with its own shitty UX (having to pass the full PROJJSON string in the type definition)
Future work
In practice
spatialcan solve almost all of the aforementioned problems thanks to thePROJlibrary, but we can't (and don't want to) always assume thespatialextension is loaded.Therefore in a future PR the plan is to introduce a
CoordinateReferenceSystemUtilclass that by default can "identify" and expand some of the most common coordinate systems from their auth code to a full projjson definition.The implementation of the
CoordinateReferenceSystemUtilcan then be overridden byspatialwhen it is loaded. We may also want to consider splitting out thePROJpart from spatial into its own smaller extension that can be auto-loaded on demand, much like how e.g. theicuextension is auto-loaded for timezones/collations.One open question is that Im still not sure of is if we want to "eagerly" try to identify/expand auth-codes/user input to PROJJSON on e.g. table creation or dataset import, or if we fully defer any interpretation lazily "until needed" when exporting to an external dataset that imposes some requirements on the CRS string. Maybe we want to make this configurable through a setting.