Skip to content

Conversation

@milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Oct 5, 2022

Part of ARROW-14596, and ARROW-13798

Does not propose to solve selecting from lists in this PR; only supporting dotted paths into arbitrarily nested structs using existing FromDotPath. Selecting from lists will require further discussion and support for kernels to select struct subsets from lists.

Until then, a user can select subsets of a struct from a list element via:

table = pa.table(
    {"user_id": ["abc123", "qrs456"],
     "interaction": [
        {
            "type": "click",
            "element": {'a': "button"},
            "values": [1, 2],
            "structs":[{"foo": "bar"}]},
        {
            "type": "scroll",
            "element": {'a': "window"},
            "values": [3, 4],
            "structs":[{"fizz": "buzz"}]}
     ]
    })

pq.write_table(table, "test_nested_data.parquet")
dataset = ds.dataset("test_nested_data.parquet")

# Select only single field from structs in a list
dataset.to_table(columns={
        "result_name": pc.struct_field(
                pc.list_element(pc.struct_field(ds.field("interaction"), [1]), 
                                ds.scalar(0)), 
                [0])
        }
)
# pyarrow.Table
# result_name: string
# ----
# result_name: [[null,"buzz"]]

# Or keeping struct shape...

struct_subset_type = pa.struct([("structs", pa.list_(pa.struct([("fizz", pa.string())])))])
dataset.to_table(columns={"interaction": ds.field("interaction").cast(struct_subset_type)})
# pyarrow.Table
# interaction: struct<structs: list<item: struct<fizz: string>>>
#   child 0, structs: list<item: struct<fizz: string>>
#       child 0, item: struct<fizz: string>
#           child 0, fizz: string
# ----
# interaction: [
#   -- is_valid: all not null
#   -- child 0 type: list<item: struct<fizz: string>>
# [      -- is_valid: all not null
#       -- child 0 type: string
# [null],      -- is_valid: all not null
#       -- child 0 type: string
# ["buzz"]]]

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

@github-actions
Copy link

github-actions bot commented Oct 5, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@milesgranger milesgranger changed the title ARROW-14596: [C++][Python] Read table nested fields in columns ARROW-14596: [C++][Python] Read table nested struct fields in columns Oct 7, 2022
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

Some questions:

  • Right now you always use the FromDotPath if the path starts with a . But to be safe, should we first check that the name is not present in the dataset_schema? (since in theory one can have a normal column name with dots)
  • Do we want to require that users specify this leading dot, like ".column.subfield"? (at least for parquet.read_table, I think we should also enable this without the dot for backwards compatibility?) That seems to be the pattern from jq, but for example in SQL you don't need that (in SQL it's of course not a string but an actual operation) and you could also use "column.subfield"
  • What should be the resulting column name? At the moment this PR uses the dotted string the user specified, but another option would be to use the last part (so the name of the final field that is being selected), which would avoid creating top-level dotted names (which are not good practice I would say), and also seems to be consistent with eg BigQuery's SQL dialect

@milesgranger
Copy link
Contributor Author

milesgranger commented Oct 7, 2022

I generally agree with the first two points, although I do like the explicitness of a leading dot.

The third point, it could be a potentially buggy convenience add on. ie.

pa.table({"a.dotted.field": [1], "nested": [{"field": "value"]})
# implicitly will create two columns called 'field', is that okay?
(..., columns=["a.dotted.field", "nested.field"])

vs explicitly mapping maybe?

(..., columns=[{"my_dotted_field": "a.dotted.field", "nested_field": "nested.field"}])

or convert dots to underscores?

@milesgranger
Copy link
Contributor Author

milesgranger commented Oct 10, 2022

Updated in 1b4914f to take the last delimited dot path as the column name. But also thought to leave any actual column with a dot alone; if the user has dotted columns, is that up to us to automatically rename? Seems a dotted path gives more leniency in this regard.

@milesgranger milesgranger marked this pull request as ready for review October 13, 2022 08:47
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have anything to add here.

Parquet is still materializing the entire column, correct?

It may be worth considering (as a follow up) some Scanner interface so that we don't have to rely solely on dotted paths (e.g. by allowing passing FieldPaths directly)

@jorisvandenbossche
Copy link
Member

Parquet is still materializing the entire column, correct?

Yes, for that I opened https://issues.apache.org/jira/browse/ARROW-17959

@jorisvandenbossche jorisvandenbossche merged commit f49f8ed into apache:master Oct 20, 2022
@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants