GEOMETRY Rework: Part 3 - Filter pushdown#19439
Conversation
GEOMETRY Rework: Part 3 - Filter pushdown
| stats_bbox.x_min = bbox.xmin; | ||
| stats_bbox.y_min = bbox.ymin; | ||
| stats_bbox.x_max = bbox.xmax; | ||
| stats_bbox.y_max = bbox.ymax; |
There was a problem hiding this comment.
Unless your GeometryStats has a validation step and/or wraparound support, this may need to handle the xmin > xmax case (which would indicate the interval -Inf...xmax || xmin ...Inf). I think in Arrow C++ I also check for NaNs and min > max for non X and ignore the stats if those cases are present.
There was a problem hiding this comment.
Thats a good point, my plan was to postpone all anti-meridian special case stuff until we handle GEOGRAPHY, but I guess the xmin>xmax clause still holds for GEOMETRY. I moved away from NaN because std::min(NaN, x) != std::min(x, NaN) and I think you need to explicitly check if the bounds have been initialized or not before you can expand them. But maybe thats unavoidable. Ill have a look at the arrow C++ implementation!
There was a problem hiding this comment.
Yeah definitely don't use NaN! We found the Java implementation was accidentally producing them in some cases during prototyping which hypothetically would have led to incorrect results in Arrow C++. I think the idea is just to sanity check statistics before ingesting them (which many implementations do for other statistics, too).
| if (allow_geometry) { // Don't set this if we write GeoParquet V1 | ||
| schema_ele.__isset.logicalType = true; | ||
| schema_ele.logicalType.__isset.GEOMETRY = true; | ||
| // TODO: Set CRS in the future |
There was a problem hiding this comment.
Thanks for reviewing this work! Really appreciate having someone else with special geo knowledge having a quick look as well.
1aed7ab to
d2a3441
Compare
d2a3441 to
3ba0e54
Compare
lnkuiper
left a comment
There was a problem hiding this comment.
Hey, this looks great! I just have two small comments:
3a7b486 to
296b4d9
Compare
lnkuiper
left a comment
There was a problem hiding this comment.
Thanks for the changes! LGTM once CI passes.
|
The regression test is just broken, this is the 3rd time I restart it and it always seem to get stuck/fail on the CSV reading |
`GEOMETRY` Rework: Part 3 - Filter pushdown (duckdb/duckdb#19439)
`GEOMETRY` Rework: Part 3 - Filter pushdown (duckdb/duckdb#19439)
…rt (#19476) This is a followup PR that builds on top of #19439. Please have a look at #19136 for the context behind this PR. This PR fixes up the remaining issues in the parquet extension related to geometries. When reading geometry columns we now push an expression column reader on top of the underlying blob column reader to perform the WKB parsing with `ST_GeomFromWKB`. `ST_GeomFromWKB` now actually checks that the input is valid WKB and also converts from big-endian WKB to little-endian If required. This can be optimized further, but It's good enough for now. I've also added support for converting geometry columns to/from arrow arrays with geoarrow extension metadata. This code is basically lifted [straight from the spatial extension](https://github.com/duckdb/duckdb-spatial/blob/v1.4-andium/src/spatial/spatial_geoarrow.cpp).
This is a followup PR that builds on top of #19203. Please have a look at #19136 for the context behind this PR.
In #19203 we added support for storing geometry statistics. In this PR we now add a
&&bounding-box-intersection binary operator that when used in a filter can be pushed down into storage and prune row-groups based on these new geometry statistics.Geometry-filter pushdown works for our own storage format, as well as parquet (if the parquet file contains parquet-native geometry stats). I've also cleaned up the parquet extension further and removed all spatial-dependent code, so that all geoparquet related functionality now works without requiring spatial to be loaded, which simplifies both the read and write-path for geometries significantly. It also enables us to run the geoparquet tests in CI when spatial isn't available.
I've also added a couple of basic geometry scalar-functions to convert to/from WKB and WKT. These will be more fleshed out in the future (to e.g. handle big-endian WKB and specify WKT precision).