Skip to content

Support shapes in ST_CENTROID_AGG#141657

Merged
craigtaverner merged 22 commits intoelastic:mainfrom
craigtaverner:st_centroid_agg_shapes
Feb 9, 2026
Merged

Support shapes in ST_CENTROID_AGG#141657
craigtaverner merged 22 commits intoelastic:mainfrom
craigtaverner:st_centroid_agg_shapes

Conversation

@craigtaverner
Copy link
Copy Markdown
Contributor

@craigtaverner craigtaverner commented Feb 2, 2026

ES|QL has an aggregation ST_CENTROID_AGG, which currently only works on points (geo and cartesian), but should work on shapes (geo and cartesian). The _search API supports aggs over shapes, and this PR expands that capability to ES|QL, making use of common code in the CentroidCalculator.

Key Implementation Details

  • Uses CentroidCalculator from the server module to compute weighted centroids
  • When shapes of different dimensions are mixed, only the highest dimension contributes (polygons > lines > points)
  • Currently only implements source values aggregators (doc-values aggregators cancelled as they require additional infrastructure changes)

Fixes #104658

AI-Assisted: This PR was assisted by Cursor using Claude Opus 4.5.
Usage: Used to generate the initial solution by building on existing code (aggs/centroid and esql/centroid)

This first commit was written entirely by Cursor. I will do cleanup commits following this based on my review.
@craigtaverner craigtaverner added >enhancement :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL labels Feb 2, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @craigtaverner, I've created a changelog YAML for you.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

🔍 Preview links for changed docs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@iverase
Copy link
Copy Markdown
Contributor

iverase commented Feb 6, 2026

Would it be possible to push the function to the data loaders? If I understand correctly we are copying the full doc values into pages which feels very inefficient.

@craigtaverner
Copy link
Copy Markdown
Contributor Author

Would it be possible to push the function to the data loaders? If I understand correctly we are copying the full doc values into pages which feels very inefficient.

Certainly, that is the plan. I wanted to do that as a followup PR, since it is a performance optimization, and relies on newer infrastructure I did not want to involve in this PR.

Copy link
Copy Markdown
Member

@ncordon ncordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. I left a few questions for my own understanding

/**
* This aggregator calculates the centroid of a set of cartesian shapes.
* It is assumed that the cartesian shapes are encoded as WKB BytesRef.
* This requires that the planner has NOT planned that shapes are loaded from the index as doc-values, but from source instead.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we don't have support for geo / cartesian shapes in doc values right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have support actually, and the doc-values for shapes includes the actual centroid, so it would be much, much better to load that, but does require more complex code, so I wanted to write that as a followup PR.


@Override
public String describe() {
return "spatial_centroid_cartesian_shape_source of valuess";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo valuess here that is also present in other ST aggregators, so I would correct them also in this pr if possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in generated code. I assume the code generator is pluralizing the word by adding an s, but the original variable already is plural. Tricky!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this has been there since Jan 2024! Very old code.

* When mixing shapes of different dimensional types, only the highest dimensional type contributes.
*/
public void add(double x, double dx, double y, double dy, double weight, DimensionalShapeType shapeType) {
if (Double.isFinite(x) == false || Double.isFinite(y) == false || Double.isFinite(weight) == false || weight <= 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases could this happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is defensive coding, but I will double-check that it is not required for null values, which we needed some effort to support for ST_EXTENT_AGG.

Comment on lines +409 to +416
FROM countries_bbox
| WHERE id IN ("NOR", "SWE", "FIN")
| STATS centroid=ST_CENTROID_AGG(shape)
;

centroid:geo_point
POINT (19.306472392149942 63.98699665426565)
;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way to check these results are correct?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me: we treat this as a multi polygon?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the centroid over a single geometry-collection or multi-polygon is the same as the centroid over many documents, each containing only a single polygon.

;

centroidFromMixedCartesianPointsAndShapes
required_capability: st_centroid_agg
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this capability right? If st_centroid_agg_shapes is there it means this other one is also supported.

But up to you if you want to be more explicit, which this is achieving 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems historically we just keep adding new required_capability lines, but you are correct we can compact to the most restrictive one. There is in fact a bug where if there are many capabilities and one fails, the error/warning is reporting the first listed capability as the failing one, which is wrong.


@Override
public void testStCentroidAggregationWithShapes() {
assertStCentroidFromIndex("index_geo_shape", 1e-9);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we set up this tolerance? Why not lower or higher?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found from many other test cases (including in the _search/agg framework) that 1e-9 is usually the threshold of accuracy we achieve in our tests. Actually, if we aggregate over much larger datasets, we accumulate larger errors, so this number kind-of matches the dataset sizes that our randomized testing tends to generate.

Comment on lines +63 to +65
Stream.of(
MultiRowTestCaseSupplier.geoShapeCasesWithoutCircle(1, 1000, IncludingAltitude.NO),
MultiRowTestCaseSupplier.cartesianShapeCasesWithoutCircle(1, 1000, IncludingAltitude.NO)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we avoiding shapes with circles?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circles are only partially supported by the whole ES/Lucene stack, so we mostly don't include them in tests. It might be nicer to try get full support, but there is not really any priority for that.

Comment on lines +124 to +125
case DataType.GEO_SHAPE -> new SpatialCentroidGeoShapeSourceValuesAggregatorFunctionSupplier();
case DataType.CARTESIAN_SHAPE -> new SpatialCentroidCartesianShapeSourceValuesAggregatorFunctionSupplier();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we throw an exception just in case we have doc values with geo / cartesian shapes? Which in principle should never happen because of the planner but just as a safeguard

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add support for EXTRACT_SPATIAL_CENTROID (the optimization mentioned in a few other comments) and in the meantime I just did not care about the fieldExtractPreference. But you are right it would be better to write the current check now, and throw exceptions on everything but NONE and STORED.

@craigtaverner craigtaverner merged commit 43ed3f7 into elastic:main Feb 9, 2026
35 checks passed
elasticsearchmachine pushed a commit that referenced this pull request Mar 17, 2026
…nd cartesian_shape (#142528)

The the PR at #141657 we
added support for ST_CENTROID_AGG over geo_shape and cartesian_shape.
However, this loads from source, which is inefficient since these
centroids are actually saved in the doc-values representations. This PR
switches to load from doc-values.

One thing to keep in mind is that we already load bounds from doc-values
for ST_EXTENT_AGG, and it is completely reasonable for the same query to
do both ST_EXTENT_AGG and ST_CENTROID_AGG, in which case we want to load
both centroid and bounds at the same time.

The solution expands on the existing code for loading bounds into a
multi-value like block. For bounds we were loading either 4 or 6
integers (for cartesian or geo extents). Now we have the following
combinations: * Geo-Bounds as 6 integers: top, bottom, negLeft,
negRight, posLeft, posRight * Cartesian-Bounds as 4 integers: min-x,
max-x, max-y, min-y * Centroid as 4 doubles: x, y, weight, shape-type *
Geo-Bounds and centroid as 10 doubles: top, bottom, negLeft, negRight,
posLeft, posRight, x, y, weight, shape-type * Cartesian-Bounds and
centroid as 8 doubles: min-x, max-x, max-y, min-y, x, y, weight,
shape-type

So we moved from 2 combinations to 6, and since centroid weights were
stored in doc-values as double, while centroid coordinately and extents
were stored as quantized ints, we also moved the conversion from int to
double to the block-loader rather than the aggregator.

Since the doc-values header does not include more useful data, it is
unlikely we will expand from this combination to anything more complex
in future.

Fixes #142640
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
…nd cartesian_shape (elastic#142528)

The the PR at elastic#141657 we
added support for ST_CENTROID_AGG over geo_shape and cartesian_shape.
However, this loads from source, which is inefficient since these
centroids are actually saved in the doc-values representations. This PR
switches to load from doc-values.

One thing to keep in mind is that we already load bounds from doc-values
for ST_EXTENT_AGG, and it is completely reasonable for the same query to
do both ST_EXTENT_AGG and ST_CENTROID_AGG, in which case we want to load
both centroid and bounds at the same time.

The solution expands on the existing code for loading bounds into a
multi-value like block. For bounds we were loading either 4 or 6
integers (for cartesian or geo extents). Now we have the following
combinations: * Geo-Bounds as 6 integers: top, bottom, negLeft,
negRight, posLeft, posRight * Cartesian-Bounds as 4 integers: min-x,
max-x, max-y, min-y * Centroid as 4 doubles: x, y, weight, shape-type *
Geo-Bounds and centroid as 10 doubles: top, bottom, negLeft, negRight,
posLeft, posRight, x, y, weight, shape-type * Cartesian-Bounds and
centroid as 8 doubles: min-x, max-x, max-y, min-y, x, y, weight,
shape-type

So we moved from 2 combinations to 6, and since centroid weights were
stored in doc-values as double, while centroid coordinately and extents
were stored as quantized ints, we also moved the conversion from int to
double to the block-loader rather than the aggregator.

Since the doc-values header does not include more useful data, it is
unlikely we will expand from this combination to anything more complex
in future.

Fixes elastic#142640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL :Analytics/Geo Indexing, search aggregations of geo points and shapes >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ST_CENTROID_AGG over shapes in ESQL

4 participants