Support shapes in ST_CENTROID_AGG#141657
Conversation
This first commit was written entirely by Cursor. I will do cleanup commits following this based on my review.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @craigtaverner, I've created a changelog YAML for you. |
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
|
Would it be possible to push the function to the data loaders? If I understand correctly we are copying the full doc values into pages which feels very inefficient. |
…sticsearch into st_centroid_agg_shapes
Certainly, that is the plan. I wanted to do that as a followup PR, since it is a performance optimization, and relies on newer infrastructure I did not want to involve in this PR. |
ncordon
left a comment
There was a problem hiding this comment.
It looks good to me. I left a few questions for my own understanding
| /** | ||
| * This aggregator calculates the centroid of a set of cartesian shapes. | ||
| * It is assumed that the cartesian shapes are encoded as WKB BytesRef. | ||
| * This requires that the planner has NOT planned that shapes are loaded from the index as doc-values, but from source instead. |
There was a problem hiding this comment.
This is because we don't have support for geo / cartesian shapes in doc values right?
There was a problem hiding this comment.
We do have support actually, and the doc-values for shapes includes the actual centroid, so it would be much, much better to load that, but does require more complex code, so I wanted to write that as a followup PR.
|
|
||
| @Override | ||
| public String describe() { | ||
| return "spatial_centroid_cartesian_shape_source of valuess"; |
There was a problem hiding this comment.
There's a typo valuess here that is also present in other ST aggregators, so I would correct them also in this pr if possible
There was a problem hiding this comment.
This is in generated code. I assume the code generator is pluralizing the word by adding an s, but the original variable already is plural. Tricky!
There was a problem hiding this comment.
Looks like this has been there since Jan 2024! Very old code.
| * When mixing shapes of different dimensional types, only the highest dimensional type contributes. | ||
| */ | ||
| public void add(double x, double dx, double y, double dy, double weight, DimensionalShapeType shapeType) { | ||
| if (Double.isFinite(x) == false || Double.isFinite(y) == false || Double.isFinite(weight) == false || weight <= 0) { |
There was a problem hiding this comment.
In what cases could this happen?
There was a problem hiding this comment.
I think this is defensive coding, but I will double-check that it is not required for null values, which we needed some effort to support for ST_EXTENT_AGG.
| FROM countries_bbox | ||
| | WHERE id IN ("NOR", "SWE", "FIN") | ||
| | STATS centroid=ST_CENTROID_AGG(shape) | ||
| ; | ||
|
|
||
| centroid:geo_point | ||
| POINT (19.306472392149942 63.98699665426565) | ||
| ; |
There was a problem hiding this comment.
Do we have any way to check these results are correct?
There was a problem hiding this comment.
Remind me: we treat this as a multi polygon?
There was a problem hiding this comment.
Yes, the centroid over a single geometry-collection or multi-polygon is the same as the centroid over many documents, each containing only a single polygon.
| ; | ||
|
|
||
| centroidFromMixedCartesianPointsAndShapes | ||
| required_capability: st_centroid_agg |
There was a problem hiding this comment.
I think we can drop this capability right? If st_centroid_agg_shapes is there it means this other one is also supported.
But up to you if you want to be more explicit, which this is achieving 😄
There was a problem hiding this comment.
It seems historically we just keep adding new required_capability lines, but you are correct we can compact to the most restrictive one. There is in fact a bug where if there are many capabilities and one fails, the error/warning is reporting the first listed capability as the failing one, which is wrong.
|
|
||
| @Override | ||
| public void testStCentroidAggregationWithShapes() { | ||
| assertStCentroidFromIndex("index_geo_shape", 1e-9); |
There was a problem hiding this comment.
How do we set up this tolerance? Why not lower or higher?
There was a problem hiding this comment.
I've found from many other test cases (including in the _search/agg framework) that 1e-9 is usually the threshold of accuracy we achieve in our tests. Actually, if we aggregate over much larger datasets, we accumulate larger errors, so this number kind-of matches the dataset sizes that our randomized testing tends to generate.
| Stream.of( | ||
| MultiRowTestCaseSupplier.geoShapeCasesWithoutCircle(1, 1000, IncludingAltitude.NO), | ||
| MultiRowTestCaseSupplier.cartesianShapeCasesWithoutCircle(1, 1000, IncludingAltitude.NO) |
There was a problem hiding this comment.
Why are we avoiding shapes with circles?
There was a problem hiding this comment.
Circles are only partially supported by the whole ES/Lucene stack, so we mostly don't include them in tests. It might be nicer to try get full support, but there is not really any priority for that.
| case DataType.GEO_SHAPE -> new SpatialCentroidGeoShapeSourceValuesAggregatorFunctionSupplier(); | ||
| case DataType.CARTESIAN_SHAPE -> new SpatialCentroidCartesianShapeSourceValuesAggregatorFunctionSupplier(); |
There was a problem hiding this comment.
Shouldn't we throw an exception just in case we have doc values with geo / cartesian shapes? Which in principle should never happen because of the planner but just as a safeguard
There was a problem hiding this comment.
I want to add support for EXTRACT_SPATIAL_CENTROID (the optimization mentioned in a few other comments) and in the meantime I just did not care about the fieldExtractPreference. But you are right it would be better to write the current check now, and throw exceptions on everything but NONE and STORED.
…nd cartesian_shape (#142528) The the PR at #141657 we added support for ST_CENTROID_AGG over geo_shape and cartesian_shape. However, this loads from source, which is inefficient since these centroids are actually saved in the doc-values representations. This PR switches to load from doc-values. One thing to keep in mind is that we already load bounds from doc-values for ST_EXTENT_AGG, and it is completely reasonable for the same query to do both ST_EXTENT_AGG and ST_CENTROID_AGG, in which case we want to load both centroid and bounds at the same time. The solution expands on the existing code for loading bounds into a multi-value like block. For bounds we were loading either 4 or 6 integers (for cartesian or geo extents). Now we have the following combinations: * Geo-Bounds as 6 integers: top, bottom, negLeft, negRight, posLeft, posRight * Cartesian-Bounds as 4 integers: min-x, max-x, max-y, min-y * Centroid as 4 doubles: x, y, weight, shape-type * Geo-Bounds and centroid as 10 doubles: top, bottom, negLeft, negRight, posLeft, posRight, x, y, weight, shape-type * Cartesian-Bounds and centroid as 8 doubles: min-x, max-x, max-y, min-y, x, y, weight, shape-type So we moved from 2 combinations to 6, and since centroid weights were stored in doc-values as double, while centroid coordinately and extents were stored as quantized ints, we also moved the conversion from int to double to the block-loader rather than the aggregator. Since the doc-values header does not include more useful data, it is unlikely we will expand from this combination to anything more complex in future. Fixes #142640
…nd cartesian_shape (elastic#142528) The the PR at elastic#141657 we added support for ST_CENTROID_AGG over geo_shape and cartesian_shape. However, this loads from source, which is inefficient since these centroids are actually saved in the doc-values representations. This PR switches to load from doc-values. One thing to keep in mind is that we already load bounds from doc-values for ST_EXTENT_AGG, and it is completely reasonable for the same query to do both ST_EXTENT_AGG and ST_CENTROID_AGG, in which case we want to load both centroid and bounds at the same time. The solution expands on the existing code for loading bounds into a multi-value like block. For bounds we were loading either 4 or 6 integers (for cartesian or geo extents). Now we have the following combinations: * Geo-Bounds as 6 integers: top, bottom, negLeft, negRight, posLeft, posRight * Cartesian-Bounds as 4 integers: min-x, max-x, max-y, min-y * Centroid as 4 doubles: x, y, weight, shape-type * Geo-Bounds and centroid as 10 doubles: top, bottom, negLeft, negRight, posLeft, posRight, x, y, weight, shape-type * Cartesian-Bounds and centroid as 8 doubles: min-x, max-x, max-y, min-y, x, y, weight, shape-type So we moved from 2 combinations to 6, and since centroid weights were stored in doc-values as double, while centroid coordinately and extents were stored as quantized ints, we also moved the conversion from int to double to the block-loader rather than the aggregator. Since the doc-values header does not include more useful data, it is unlikely we will expand from this combination to anything more complex in future. Fixes elastic#142640
ES|QL has an aggregation ST_CENTROID_AGG, which currently only works on points (geo and cartesian), but should work on shapes (geo and cartesian). The _search API supports aggs over shapes, and this PR expands that capability to ES|QL, making use of common code in the CentroidCalculator.
Key Implementation Details
Fixes #104658
AI-Assisted: This PR was assisted by Cursor using Claude Opus 4.5.
Usage: Used to generate the initial solution by building on existing code (aggs/centroid and esql/centroid)