Skip to content

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Jun 3, 2025

Background

Several engines including Spark/Sedona and Presto explicitly mention the calculations on geometries are done via Cartesian mathematics. It would be good if we add clarification to the Iceberg spec

Proposed change

Added the following sentence

and geometric calculations are performed using Cartesian mathematics along straight lines

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jun 3, 2025
@jiayuasu
Copy link
Member Author

jiayuasu commented Jun 3, 2025

@rdblue @szehon-ho please review :-)

format/spec.md Outdated
| | **`fixed(L)`** | Fixed-length byte array of length L | |
| | **`binary`** | Arbitrary-length byte array | |
| [v3](#version-3) | **`geometry(C)`** | Geospatial features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | |
| [v3](#version-3) | **`geometry(C)`** | Geospatial features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar, and geometric calculations are performed using Cartesian mathematics along straight lines. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

Edge-interpolation is always linear/planar, ie geometric calculations are performed using Cartesian mathematics along straight lines.

Then its more clear its a clarification rather than another fact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

| | **`fixed(L)`** | Fixed-length byte array of length L | |
| | **`binary`** | Arbitrary-length byte array | |
| [v3](#version-3) | **`geometry(C)`** | Geospatial features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | |
| [v3](#version-3) | **`geometry(C)`** | Geospatial features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar, i.e., geometric calculations are performed using Cartesian mathematics along straight lines. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that the purpose of Cartesian mathematics is to avoid the need to interpret the CRS?

Something like:

The CRS may be used to interpret points as geographic locations, but does not affect calculations, which are always Cartesian.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed. Please review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I think @mkaravel suggests some good changes to make to clarify the bounding boxes for geometry.

@mkaravel
Copy link

mkaravel commented Jun 5, 2025

@jiayuasu Thank you for driving this!

I am not able to comment directly at appropriate place since there are no changes there, which is why I writing this as a high level comment in the PR.

In line 684 of the existing spec (prior to this PR) we state:

For geometry and geography types, lower_bounds and upper_bounds are both points of the following coordinates X, Y, Z, and M (see Appendix G) which are the lower / upper bound of all objects in the file. For the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that x >= xmin ORx <= xmax. In geographic terminology, the concepts of xmin, xmax, ymin, and ymax are also known as westernmost, easternmost, southernmost and northernmost, respectively. For geography types, these points are further restricted to the canonical ranges of [-180 180] for X and [-90 90] for Y.

I believe that in order to make the spec self-consistent it would make sense to rephrase this and specify that xmin can be greater than xmax for geographies only.

How about the following slight modification?

For geometry and geography types, lower_bounds and upper_bounds are both points of the following coordinates X, Y, Z, and M (see Appendix G) which are the lower / upper bound of all objects in the file.

For geography, for the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that x >= xmin OR x <= xmax. In geographic terminology, the concepts of xmin, xmax, ymin, and ymax are also known as westernmost, easternmost, southernmost and northernmost, respectively. These points are further restricted to the canonical ranges of [-180..180] for X and [-90..90] for Y.

@jiayuasu
Copy link
Member Author

jiayuasu commented Jun 5, 2025

@mkaravel I think this is relevant to our discussion in Slack. After giving some thoughts, I think it makes sense to consider that together with the linear option in Geography type. I guess we can make another PR to discuss that specific issue? This PR is just to add 1 sentence clarification about the geometric calculation

@rdblue
Copy link
Contributor

rdblue commented Jun 6, 2025

@jiayuasu, can you help me understand why we wouldn't make the clarifications @mkaravel suggests in this PR?

If we don't expect the CRS to an effect on calculations for geometry and only geography is restricted to [-180, 180] then it makes sense to me that geometry bboxes should never have xmin > xmax. Clarifying that here where we are already noting that the calculations are purely Cartesian makes sense to me!

@jiayuasu
Copy link
Member Author

jiayuasu commented Jun 6, 2025

@rdblue I think it makes sense to drop xmin > xmax wraparound in geometry if we don't expect CRS to have an effect.

On the other hand, I'd like to also add linear interpolation to geography if we want to drop this. This is an idea proposed by @mkaravel before. Because there will be use cases that needs linear interpolation (cartesian mathematics) on antimeridian crossing geometries on geographic CRS.

I am fine adding the sentence to disallow wraparound here and create another PR to add linear interpolation although I think the 2 issues are interconnected. Or, I can put everything in this PR instead so we can solve these issues together

Please let me know what you think

@mkaravel
Copy link

mkaravel commented Jun 6, 2025

@rdblue I think it makes sense to drop xmin > xmax wraparound in geometry if we don't expect CRS to have an effect.

On the other hand, I'd like to also add linear interpolation to geography if we want to drop this. This is an idea proposed by @mkaravel before. Because there will be use cases that needs linear interpolation (cartesian mathematics) on antimeridian crossing geometries on geographic CRS.

@jiayuasu I am obviously (since I have suggested it in the past) totally on board with this plan. This will make the semantics of the types very clean (geometry always Cartesian, calculations independent of the CRS) and allow geography to have an easy, practical, and meaningful way to deal with the antimeridian geometries (also very clean IMHO).

I am fine adding the sentence to disallow wraparound here and create another PR to add linear interpolation although I think the 2 issues are interconnected. Or, I can put everything in this PR instead so we can solve these issues together

My thinking when I proposed the changes above in this PR was to make it self-consistent and self-contained regarding the geometry type. We can follow up with a separate PR for geography or add the changes here. Either way is fine with me. Maybe a single PR is just simpler.

@rdblue Are you okay with this? Thoughts?

@szehon-ho
Copy link
Member

szehon-ho commented Jun 10, 2025

That's great there's agreement from @jiayuasu on adding the linear algorithm to Geography . Im also ok either way to add the change here or in another pr. If we have agreement we can do it here to minimize disruptions, as the changes seem related.

The only thing to check I had is, is it safe to add a new Edge Interpolation algorithm after V3 is adopted. I am thinking yes, as implementations would just not be able to read if its a CRS or Edge algorithm they dont know.

The other thing is, I guess adding the new algorithm seems its large enough to requires a vote. cc @rdblue for thoughts?

@rdblue
Copy link
Contributor

rdblue commented Jun 11, 2025

I agree that we can move forward here and then discuss the change to add linear separately. I think clarifying how geometry works (that it is always independent of the CRS and cartesian) is independently valuable. Then we can follow up to add linear if it is needed next.

@mkaravel
Copy link

Let's finalize this PR then by completing Cartesian semantics for GEOMETRY and do GEOGRAPHY as a follow up.

@mkaravel
Copy link

Any chance linear interpolation for GEOGRAPHY can make it to Iceberg v3?

@jiayuasu
Copy link
Member Author

jiayuasu commented Jun 16, 2025

Ping geo folks and see if they have comments

@jorisvandenbossche
@paleolimbot
@kylebarron
@cholmes
@migurski
@pramsey

@paleolimbot
Copy link
Member

Thank you for the ping!

Just a note that I don't think this clarification is needed...a Cartesian bounding box like the one described here is perfectly legal in for the geometry type in GeoArrow's box type, GeoParquet's file metadata, GeoJSON, Parquet GeoStatistics, and the current Iceberg spec (even though all of these allow for the wraparound case as well). I personally found it to be a nice simplification that evaluating a filter expression against a Parquet GeoStatistics is the same whether the type is GEOMETRY or GEOGRAPHY.

I'll save my comments regarding the possibility of a "linear" edge algorithm for the GEOGRAPHY type for the PR implementing that change (the short version is: call it something other than GEOGRAPHY because that's not what that type means in any other database, file format, or spec of which I am aware).

@rdblue
Copy link
Contributor

rdblue commented Jul 7, 2025

Just a note that I don't think this clarification is needed...a Cartesian bounding box like the one described here is perfectly legal in for the geometry type in GeoArrow's box type, GeoParquet's file metadata, GeoJSON, Parquet GeoStatistics, and the current Iceberg spec (even though all of these allow for the wraparound case as well).

How does a writer know what bounding box to produce? In all of those specs, is it the writer's decision whether to use a purely cartesian bounding box or one that wraps around?

What I want to avoid is needing to interpret the CRS to know how to compute the bounding box. If writers are allowed to produce a box that is cartesian and ignore whether longitude is periodic, then maybe that is fine because it is the norm. But my expectation is that this would be determined by the CRS and if we wanted to produce a simpler bounding box it would be incorrect without this change.

@github-actions
Copy link

github-actions bot commented Aug 7, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 7, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants