Skip to content

GEOMETRY Rework: Part 3 - Filter pushdown#19439

Merged
lnkuiper merged 16 commits intoduckdb:mainfrom
Maxxen:core-geom-step-3-filter-pushdown
Oct 27, 2025
Merged

GEOMETRY Rework: Part 3 - Filter pushdown#19439
lnkuiper merged 16 commits intoduckdb:mainfrom
Maxxen:core-geom-step-3-filter-pushdown

Conversation

@Maxxen
Copy link
Member

@Maxxen Maxxen commented Oct 17, 2025

This is a followup PR that builds on top of #19203. Please have a look at #19136 for the context behind this PR.

In #19203 we added support for storing geometry statistics. In this PR we now add a && bounding-box-intersection binary operator that when used in a filter can be pushed down into storage and prune row-groups based on these new geometry statistics.

Geometry-filter pushdown works for our own storage format, as well as parquet (if the parquet file contains parquet-native geometry stats). I've also cleaned up the parquet extension further and removed all spatial-dependent code, so that all geoparquet related functionality now works without requiring spatial to be loaded, which simplifies both the read and write-path for geometries significantly. It also enables us to run the geoparquet tests in CI when spatial isn't available.

I've also added a couple of basic geometry scalar-functions to convert to/from WKB and WKT. These will be more fleshed out in the future (to e.g. handle big-endian WKB and specify WKT precision).

@Maxxen Maxxen changed the title GEOMETRY Rework: Part 3 - Filter pushdown GEOMETRY Rework: Part 3 - Filter pushdown Oct 17, 2025
Comment on lines +423 to +426
stats_bbox.x_min = bbox.xmin;
stats_bbox.y_min = bbox.ymin;
stats_bbox.x_max = bbox.xmax;
stats_bbox.y_max = bbox.ymax;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless your GeometryStats has a validation step and/or wraparound support, this may need to handle the xmin > xmax case (which would indicate the interval -Inf...xmax || xmin ...Inf). I think in Arrow C++ I also check for NaNs and min > max for non X and ignore the stats if those cases are present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point, my plan was to postpone all anti-meridian special case stuff until we handle GEOGRAPHY, but I guess the xmin>xmax clause still holds for GEOMETRY. I moved away from NaN because std::min(NaN, x) != std::min(x, NaN) and I think you need to explicitly check if the bounds have been initialized or not before you can expand them. But maybe thats unavoidable. Ill have a look at the arrow C++ implementation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely don't use NaN! We found the Java implementation was accidentally producing them in some cases during prototyping which hypothetically would have led to incorrect results in Arrow C++. I think the idea is just to sanity check statistics before ingesting them (which many implementations do for other statistics, too).

if (allow_geometry) { // Don't set this if we write GeoParquet V1
schema_ele.__isset.logicalType = true;
schema_ele.logicalType.__isset.GEOMETRY = true;
// TODO: Set CRS in the future
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this work! Really appreciate having someone else with special geo knowledge having a quick look as well.

@Maxxen Maxxen force-pushed the core-geom-step-3-filter-pushdown branch from 1aed7ab to d2a3441 Compare October 20, 2025 13:45
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 20, 2025 13:45
@Maxxen Maxxen marked this pull request as ready for review October 20, 2025 16:07
@Maxxen Maxxen force-pushed the core-geom-step-3-filter-pushdown branch from d2a3441 to 3ba0e54 Compare October 22, 2025 07:13
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 22, 2025 07:14
@Maxxen Maxxen marked this pull request as ready for review October 23, 2025 17:02
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 26, 2025 20:41
@Maxxen Maxxen marked this pull request as ready for review October 26, 2025 20:42
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this looks great! I just have two small comments:

@Maxxen Maxxen force-pushed the core-geom-step-3-filter-pushdown branch from 3a7b486 to 296b4d9 Compare October 27, 2025 08:57
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 27, 2025 08:58
@Maxxen Maxxen marked this pull request as ready for review October 27, 2025 08:58
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 27, 2025 09:05
@Maxxen Maxxen marked this pull request as ready for review October 27, 2025 09:21
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM once CI passes.

@Maxxen
Copy link
Member Author

Maxxen commented Oct 27, 2025

The regression test is just broken, this is the 3rd time I restart it and it always seem to get stuck/fail on the CSV reading

@lnkuiper lnkuiper merged commit f1f9fc1 into duckdb:main Oct 27, 2025
97 of 98 checks passed
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Nov 2, 2025
krlmlr added a commit to krlmlr/duckdb-r that referenced this pull request Nov 2, 2025
Mytherin added a commit that referenced this pull request Nov 7, 2025
…rt (#19476)

This is a followup PR that builds on top of
#19439. Please have a look at
#19136 for the context behind this
PR.

This PR fixes up the remaining issues in the parquet extension related
to geometries. When reading geometry columns we now push an expression
column reader on top of the underlying blob column reader to perform the
WKB parsing with `ST_GeomFromWKB`. `ST_GeomFromWKB` now actually checks
that the input is valid WKB and also converts from big-endian WKB to
little-endian If required. This can be optimized further, but It's good
enough for now.

I've also added support for converting geometry columns to/from arrow
arrays with geoarrow extension metadata. This code is basically lifted
[straight from the spatial
extension](https://github.com/duckdb/duckdb-spatial/blob/v1.4-andium/src/spatial/spatial_geoarrow.cpp).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants