Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 7, 2025

As discussed on the mailing list, it's best to get example files early!

Code to generate in details (requires apache/arrow@main...paleolimbot:arrow:parquet-geo-write-files-from-geoarrow , which is a slightly more functional but less appropriate initial version of apache/arrow#45459 ). I've also added the full suite of geoarrow-data files (even the big ones) to that forthcoming release: https://github.com/geoarrow/geoarrow-data .

Details
import urllib.request
import json

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga

manifest_url = (
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json"
)
files = {}
with urllib.request.urlopen(manifest_url) as f:
    manifest = json.load(f)
    for group in manifest["groups"]:
        for file in group["files"]:
            if file["format"] == "arrows/wkb":
                files[group["name"] + "_" + file["name"]] = file["url"]

out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial"
ones_that_didnt_work = []
for name, url in files.items():
    # Skip big files + one CRS example that includes a non-PROJJSON value
    # on purpose (allowed in GeoArrow), which is rightly rejected
    # by Parquet
    if (
        "microsoft-buildings" in name
        or ("ns-water" in name and name != "ns-water_water-point")
        or "wkt2" in name
    ):
        print(f"Skipping {name}")
        continue

    # Maintain chunking from IPC into Parquet
    out = f"{out_dir}/{name}.parquet"
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader,
        parquet.ParquetWriter(
            out,
            reader.schema,
            store_schema=False,
            compression="none",
            write_geospatial_logical_types=True,
        ) as writer,
    ):
        original_schema = reader.schema
        print(f"Reading {url}")
        for batch in reader:
            writer.write_batch(batch)
        print(f"Wrote {out}")
    
    # Read in original table for comparison
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader
    ):
        original_table = reader.read_all()

    print(f"Checking {out}")
    with parquet.ParquetFile(out, arrow_extensions_enabled=True) as f:
        if f.schema_arrow != original_table.schema:
            print(f"Schema mismatch:\n{f.schema_arrow}\nvs\n{original_schema}")
            continue

        reread = f.read()
        if reread != original_table:
            print("Table mismatch")

@paleolimbot paleolimbot changed the title [WIP] Draft example files for GEOMETRY and GEOGRAPHY logical type Draft example files for GEOMETRY and GEOGRAPHY logical type Feb 21, 2025
@paleolimbot paleolimbot marked this pull request as ready for review February 21, 2025 10:57
@paleolimbot
Copy link
Member Author

@Kontinuation @zhangfengcdt Can you give these a try from Java when you're ready? I'm fairly confident that they are correct, including the "crs" examples that dump the actual payload of the PROJJSON to the file metadata.

@paleolimbot paleolimbot changed the title Draft example files for GEOMETRY and GEOGRAPHY logical type Example files for GEOMETRY and GEOGRAPHY logical type Feb 21, 2025
@paleolimbot
Copy link
Member Author

I pushed an update to three files here - the original fields that PROJJSON crses were written to were very likely to collide with eachother if you did things like read a Parquet file, filter it, then write it again 😬 . The new files add a hash of the value to the end of the key (e.g., projjson_crs_value_0ffad8372). Totally up for discussion whether that's a good idea or not 🙂 .

@paleolimbot
Copy link
Member Author

I updated these to be a bit more intentional about the corner cases we collectively ran into in apache/parquet-java#2971 and apache/arrow#45459. I'm not sure the Python files to generate them belong in this repo but it does make it easier to see what they contain. I also included CRS examples because that was also something that required some thinking about in the C++ PR...happy to remove or tweak any of these if I didn't get the spirit of the format change right 🙂 .

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Today at the Parquet sync @emkornfield said he might have some time to review this PR

@emkornfield
Copy link

This all seems reasonable, going to merge.

@emkornfield emkornfield merged commit d1f14a0 into apache:master Apr 30, 2025
@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Thank you @emkornfield and @paleolimbot 🙏

@paleolimbot
Copy link
Member Author

Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants