Skip to content

Conversation

@Morgan279
Copy link

Which issue does this PR close?

Closes #6213 .

Rationale for this change

The MapBuilder uses the nonstandardized default names, which results in #6213 . Changing to parquet spec helps reduce confusion and provides users with a more standardized naming guide.

According to the parquet-format spec, the outer-most level should be a group that contains a single field named key_value for Map type:

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the map is nullable.
The middle level, named key_value, must be a repeated group with a key field for map keys and, optionally, a value field for map values. It must not contain any other values.

Changing the default map field names to match it not only complies with the parquet spec, but also aligns with pyarrow.

What changes are included in this PR?

Default value of the MapFieldNames

Are there any user-facing changes?

No(I think)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 29, 2024
@tustvold
Copy link
Contributor

tustvold commented Nov 29, 2024

Can you confirm pyarrow does follow this convention, the arrow spec has different guidance

https://github.com/apache/arrow-rs/blob/main/format%2FSchema.fbs#L133

I'm also rather wary of making this change as it will be highly disruptive, and for relatively limited benefit

Edit: in fact the linked issue shows pyarrow coercing when writing to parquet

import pyarrow as pa
import pyarrow.parquet as pq

pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
    [
        pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
    ]
)
table = pa.Table.from_pylist(pylist, schema=schema)

# table.schema
#
# map_type: map<large_string, large_binary>
#   child 0, entries: struct<key: large_string not null, value: large_binary> not null
#       child 0, key: large_string not null
#       child 1, value: large_binary

This boils down to a similar issue to #6733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Option To Coerce Map Type on Parquet Write

2 participants