Example files for GEOMETRY and GEOGRAPHY logical type #70

paleolimbot · 2025-02-07T22:26:57Z

As discussed on the mailing list, it's best to get example files early!

Code to generate in details (requires apache/arrow@main...paleolimbot:arrow:parquet-geo-write-files-from-geoarrow , which is a slightly more functional but less appropriate initial version of apache/arrow#45459 ). I've also added the full suite of geoarrow-data files (even the big ones) to that forthcoming release: https://github.com/geoarrow/geoarrow-data .

Details

import urllib.request
import json

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga

manifest_url = (
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json"
)
files = {}
with urllib.request.urlopen(manifest_url) as f:
    manifest = json.load(f)
    for group in manifest["groups"]:
        for file in group["files"]:
            if file["format"] == "arrows/wkb":
                files[group["name"] + "_" + file["name"]] = file["url"]

out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial"
ones_that_didnt_work = []
for name, url in files.items():
    # Skip big files + one CRS example that includes a non-PROJJSON value
    # on purpose (allowed in GeoArrow), which is rightly rejected
    # by Parquet
    if (
        "microsoft-buildings" in name
        or ("ns-water" in name and name != "ns-water_water-point")
        or "wkt2" in name
    ):
        print(f"Skipping {name}")
        continue

    # Maintain chunking from IPC into Parquet
    out = f"{out_dir}/{name}.parquet"
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader,
        parquet.ParquetWriter(
            out,
            reader.schema,
            store_schema=False,
            compression="none",
            write_geospatial_logical_types=True,
        ) as writer,
    ):
        original_schema = reader.schema
        print(f"Reading {url}")
        for batch in reader:
            writer.write_batch(batch)
        print(f"Wrote {out}")
    
    # Read in original table for comparison
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader
    ):
        original_table = reader.read_all()

    print(f"Checking {out}")
    with parquet.ParquetFile(out, arrow_extensions_enabled=True) as f:
        if f.schema_arrow != original_table.schema:
            print(f"Schema mismatch:\n{f.schema_arrow}\nvs\n{original_schema}")
            continue

        reread = f.read()
        if reread != original_table:
            print("Table mismatch")

paleolimbot · 2025-02-21T10:59:34Z

@Kontinuation @zhangfengcdt Can you give these a try from Java when you're ready? I'm fairly confident that they are correct, including the "crs" examples that dump the actual payload of the PROJJSON to the file metadata.

paleolimbot · 2025-02-27T22:55:08Z

I pushed an update to three files here - the original fields that PROJJSON crses were written to were very likely to collide with eachother if you did things like read a Parquet file, filter it, then write it again 😬 . The new files add a hash of the value to the end of the key (e.g., projjson_crs_value_0ffad8372). Totally up for discussion whether that's a good idea or not 🙂 .

paleolimbot · 2025-04-04T05:28:25Z

I updated these to be a bit more intentional about the corner cases we collectively ran into in apache/parquet-java#2971 and apache/arrow#45459. I'm not sure the Python files to generate them belong in this repo but it does make it easier to see what they contain. I also included CRS examples because that was also something that required some thinking about in the C++ PR...happy to remove or tweak any of these if I didn't get the spirit of the format change right 🙂 .

alamb · 2025-04-30T17:35:50Z

Today at the Parquet sync @emkornfield said he might have some time to review this PR

emkornfield · 2025-04-30T18:25:45Z

This all seems reasonable, going to merge.

alamb · 2025-04-30T19:47:09Z

Thank you @emkornfield and @paleolimbot 🙏

paleolimbot · 2025-04-30T19:53:23Z

Thank you both!

paleolimbot added 3 commits February 7, 2025 16:24

add some files

708f972

slightly better crs representations for projjson

7c0efaa

maybe actually fix

45de3bd

paleolimbot mentioned this pull request Feb 21, 2025

Add Parquet files with built-in GEOMETRY type geoarrow/geoarrow-data#5

Merged

remove redundant files

7f95759

paleolimbot changed the title ~~[WIP] Draft example files for GEOMETRY and GEOGRAPHY logical type~~ Draft example files for GEOMETRY and GEOGRAPHY logical type Feb 21, 2025

paleolimbot marked this pull request as ready for review February 21, 2025 10:57

paleolimbot changed the title ~~Draft example files for GEOMETRY and GEOGRAPHY logical type~~ Example files for GEOMETRY and GEOGRAPHY logical type Feb 21, 2025

rewrite with crs keys that are less likely to collide

9e00f7d

paleolimbot mentioned this pull request Mar 5, 2025

Add support for GEOMETRY and GEOGRAPHY types in Parquet read and/or write apache/arrow-rs#7240

Open

paleolimbot added 5 commits April 2, 2025 12:52

remove previous files

828d56e

simpler example file

a6f0d99

fix WKB in file for emtpy multiX types

cd4f70d

add crs files

7cdff4d

add nan case

39a370b

paleolimbot mentioned this pull request Apr 4, 2025

PARQUET-2417: Add statistics support to geometry logical type apache/parquet-java#2971

Merged

4 tasks

paleolimbot added 2 commits April 11, 2025 21:59

rewrite geospatial.parquet with new empty/all null logic

f205e85

update Shapely stats calculator for new files

453fbe0

This was referenced Apr 24, 2025

GH-45522: [Parquet][C++] Parquet GEOMETRY and GEOGRAPHY logical type implementations apache/arrow#45459

Merged

[Python][Parquet][C++] Test GEOMETRY and GEOGRAPHY types with parquet-testing files when available apache/arrow#46266

Open

emkornfield merged commit d1f14a0 into apache:master Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example files for GEOMETRY and GEOGRAPHY logical type #70

Example files for GEOMETRY and GEOGRAPHY logical type #70

Uh oh!

paleolimbot commented Feb 7, 2025 •

edited

Loading

Uh oh!

paleolimbot commented Feb 21, 2025

Uh oh!

paleolimbot commented Feb 27, 2025

Uh oh!

paleolimbot commented Apr 4, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

emkornfield commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

paleolimbot commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Example files for GEOMETRY and GEOGRAPHY logical type #70

Example files for GEOMETRY and GEOGRAPHY logical type #70

Uh oh!

Conversation

paleolimbot commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paleolimbot commented Feb 21, 2025

Uh oh!

paleolimbot commented Feb 27, 2025

Uh oh!

paleolimbot commented Apr 4, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

emkornfield commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

paleolimbot commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paleolimbot commented Feb 7, 2025 •

edited

Loading