-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14596: [C++][Python] Read table nested struct fields in columns #14326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14596: [C++][Python] Read table nested struct fields in columns #14326
Conversation
|
|
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
Some questions:
- Right now you always use the FromDotPath if the path starts with a
.But to be safe, should we first check that the name is not present in the dataset_schema? (since in theory one can have a normal column name with dots) - Do we want to require that users specify this leading dot, like
".column.subfield"? (at least for parquet.read_table, I think we should also enable this without the dot for backwards compatibility?) That seems to be the pattern from jq, but for example in SQL you don't need that (in SQL it's of course not a string but an actual operation) and you could also use"column.subfield" - What should be the resulting column name? At the moment this PR uses the dotted string the user specified, but another option would be to use the last part (so the name of the final field that is being selected), which would avoid creating top-level dotted names (which are not good practice I would say), and also seems to be consistent with eg BigQuery's SQL dialect
|
I generally agree with the first two points, although I do like the explicitness of a leading dot. The third point, it could be a potentially buggy convenience add on. ie. pa.table({"a.dotted.field": [1], "nested": [{"field": "value"]})
# implicitly will create two columns called 'field', is that okay?
(..., columns=["a.dotted.field", "nested.field"])vs explicitly mapping maybe? (..., columns=[{"my_dotted_field": "a.dotted.field", "nested_field": "nested.field"}])or convert dots to underscores? |
|
Updated in 1b4914f to take the last delimited dot path as the column name. But also thought to leave any actual column with a dot alone; if the user has dotted columns, is that up to us to automatically rename? Seems a dotted path gives more leniency in this regard. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have anything to add here.
Parquet is still materializing the entire column, correct?
It may be worth considering (as a follow up) some Scanner interface so that we don't have to rely solely on dotted paths (e.g. by allowing passing FieldPaths directly)
Yes, for that I opened https://issues.apache.org/jira/browse/ARROW-17959 |
|
Thanks! |
Part of ARROW-14596, and ARROW-13798
Does not propose to solve selecting from lists in this PR; only supporting dotted paths into arbitrarily nested structs using existing
FromDotPath. Selecting from lists will require further discussion and support for kernels to select struct subsets from lists.Until then, a user can select subsets of a struct from a list element via: