-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
non apache CI: https://github.com/bkietz/arrow/runs/1664709650 |
cpp/src/arrow/dataset/partition.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on this? It's not obvious why we're limited by the size of an int16_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked it arbitrarily, to be honest. A huge number of groups seemed likely to be an error to see who would ask about it. Should we instead allow the full range of int32? @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'll remove the constant kMaxGroups and allow any int32 group count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never know if someone has a strange use case requiring a lot of groups, so if there is not a technical reason, I think it's good to just allow it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, is a separate file created for each group? If so, it makes sense to put a configurable limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at least one file for each group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it's definitely worth having a reasonably small configurable limit (such as 100). I suspect it's easy to end up with Arrow creating a million files if you do a mistake in choosing your partition columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it is configurable, then that is fine for me.
But I think something like 100 is too small. For example, the NYC taxi dataset partitioned by year + month for 8 years of data already has 8*12 = 96 groups. And I think partitioning by day is not that uncommon in practice for big data (although for those cases you will probably not write that all at once)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add max groups as a member of WriteOptions
|
non apache CI: https://github.com/bkietz/arrow/runs/1682804367 |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few questions about how to interact with this as a user (the dictionaries in the Partitioning API)
cpp/src/arrow/dataset/partition.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice docstrings! Those examples help a lot
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you required to pass those dictionaries for writing?
Based on a quick test locally, it seems to work without as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even more, it doesn't seem to impact the result if I set different values there than the actual values in the partition column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising; you should see errors: No dictionary provided for dictionary field part: dictionary<values=string, indices=int32, ordered=0> if you specify an incorrect dictionary and Dictionary supplied for field part: dictionary<values=string, indices=int32, ordered=0> does not contain 'a' if you specify a dictionary which doesn't include all the column's values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've updated the docstring and put a comment in the test to indicate that dictionaries are required for parsing dict fields.
|
@jorisvandenbossche @pitrou addressed comments, PTAL |
|
@bkietz Thanks for the updates! I added one more test with the example from the JIRA, so to also have a test that ensures writing a table also works without specifying the dictionaries. Is it possible to specify the |
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary is ignored here? That doesn't sound right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll adopt the Dict[str, Array] pattern, which will remove this discrepancy from the python interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit weird and inconvenient as an API. Why not accept a Dict[str, Array] mapping field names to dictionaries?
|
While it's not a blocker, I admit I don't understand why it's necessary to pass dictionary values, and in which case (only for writing?). |
|
@pitrou The dictionaries are a feature inspired by ParquetDataset: it's useful for each partition expression to contain the dictionary of all unique values that field could take. They are only required when parsing paths. When constructing a Partitioning from a factory (inferring fields from a vector of paths) the dictionaries are assembled automatically. However if the Partitioning is being directly constructed then the dictionaries must be explicitly specified. @jorisvandenbossche I'll add a binding for |
…ictionary columns
c116ed2 to
2cdb375
Compare
|
Since both @pitrou and I got a bit confused about this As I understand it now:
This behaviour of requiring explicit dictionaries when reading a dataset with a Partitioning object with a schema including dictionary fields already exists in 1.0 and 2.0 (only without any way to get around the error "No dictionary provided for dictionary field", except by letting the partitioning be discovered instead of specifying a schema). So that's certainly fine for 3.0.0 as well. But, I am personally still wondering if we can't allow this for reading as well to have those dictionaries unspecified but discovered, even when specifying an explicit schema (eg it allows to have mixed dictionary / non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 (and I added a test about that in the PR fixing it (#6641 (comment)), but that got apparently lost in a rebase ;)), but I suppose this was changed after ensuring that the dictionary-typed partition fields "know" the full dictionary of all possible values the dataset (#7536 (comment)). |
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.
|
I opened https://issues.apache.org/jira/browse/ARROW-11260 for the "require dictionaries or not" question |
Enables usage of dictionary columns as partition columns on write.
Additionally resolves some partition-related follow ups from #8894 (@pitrou):
At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly: