Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jan 7, 2021

Enables usage of dictionary columns as partition columns on write.

Additionally resolves some partition-related follow ups from #8894 (@pitrou):

At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly:

data/
    col=a/
        part-0.parquet # col is "a" throughout
    col=b/
        part-1.parquet # col is "b" throughout
    part-2.parquet # col is null throughout

@github-actions
Copy link

github-actions bot commented Jan 7, 2021

@bkietz
Copy link
Member Author

bkietz commented Jan 7, 2021

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on this? It's not obvious why we're limited by the size of an int16_t.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked it arbitrarily, to be honest. A huge number of groups seemed likely to be an error to see who would ask about it. Should we instead allow the full range of int32? @jorisvandenbossche

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'll remove the constant kMaxGroups and allow any int32 group count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never know if someone has a strange use case requiring a lot of groups, so if there is not a technical reason, I think it's good to just allow it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, is a separate file created for each group? If so, it makes sense to put a configurable limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least one file for each group

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's definitely worth having a reasonably small configurable limit (such as 100). I suspect it's easy to end up with Arrow creating a million files if you do a mistake in choosing your partition columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it is configurable, then that is fine for me.
But I think something like 100 is too small. For example, the NYC taxi dataset partitioned by year + month for 8 years of data already has 8*12 = 96 groups. And I think partitioning by day is not that uncommon in practice for big data (although for those cases you will probably not write that all at once)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add max groups as a member of WriteOptions

@bkietz
Copy link
Member Author

bkietz commented Jan 11, 2021

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few questions about how to interact with this as a user (the dictionaries in the Partitioning API)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docstrings! Those examples help a lot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you required to pass those dictionaries for writing?
Based on a quick test locally, it seems to work without as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even more, it doesn't seem to impact the result if I set different values there than the actual values in the partition column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising; you should see errors: No dictionary provided for dictionary field part: dictionary<values=string, indices=int32, ordered=0> if you specify an incorrect dictionary and Dictionary supplied for field part: dictionary<values=string, indices=int32, ordered=0> does not contain 'a' if you specify a dictionary which doesn't include all the column's values

Copy link
Member Author

@bkietz bkietz Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've updated the docstring and put a comment in the test to indicate that dictionaries are required for parsing dict fields.

@bkietz
Copy link
Member Author

bkietz commented Jan 13, 2021

@jorisvandenbossche @pitrou addressed comments, PTAL

@jorisvandenbossche
Copy link
Member

@bkietz Thanks for the updates! I added one more test with the example from the JIRA, so to also have a test that ensures writing a table also works without specifying the dictionaries.

Is it possible to specify the max_partitions option from python?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary is ignored here? That doesn't sound right.

Copy link
Member Author

@bkietz bkietz Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adopt the Dict[str, Array] pattern, which will remove this discrepancy from the python interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird and inconvenient as an API. Why not accept a Dict[str, Array] mapping field names to dictionaries?

@pitrou
Copy link
Member

pitrou commented Jan 14, 2021

While it's not a blocker, I admit I don't understand why it's necessary to pass dictionary values, and in which case (only for writing?).

@bkietz
Copy link
Member Author

bkietz commented Jan 14, 2021

@pitrou The dictionaries are a feature inspired by ParquetDataset: it's useful for each partition expression to contain the dictionary of all unique values that field could take. They are only required when parsing paths. When constructing a Partitioning from a factory (inferring fields from a vector of paths) the dictionaries are assembled automatically. However if the Partitioning is being directly constructed then the dictionaries must be explicitly specified.

@jorisvandenbossche I'll add a binding for max_partitions to python

@bkietz bkietz force-pushed the 10247-Cannot-write-dataset-with branch from c116ed2 to 2cdb375 Compare January 15, 2021 01:46
@jorisvandenbossche
Copy link
Member

Since both @pitrou and I got a bit confused about this dictionaries, we should maybe try to further clarify the documentation around it. A more explicit test could maybe also help (although I didn't check the C++ tests).
(can be a follow-up, since I think needs to get in for the release?)

As I understand it now:

  • It's only needed in ds.dataset(..) when passing a schema, i.e. which creates an actual Partitioning object, and not a PartitioningFactory (which will infer the schema (and the dictionary values) from the file paths)
  • In addition, it's only needed to specify it when reading, and not when writing with a Partitioning (so can create and use a schema-based Partitioning object without specifying the dictionaries).
    This is the "only required when parsing paths" (the docstring says "or an error will be raised in parsing"), since we don't need to parse paths when writing. But for a user the "parsing paths" == "reading" (in practice) might not necessarily be clear.

This behaviour of requiring explicit dictionaries when reading a dataset with a Partitioning object with a schema including dictionary fields already exists in 1.0 and 2.0 (only without any way to get around the error "No dictionary provided for dictionary field", except by letting the partitioning be discovered instead of specifying a schema). So that's certainly fine for 3.0.0 as well.

But, I am personally still wondering if we can't allow this for reading as well to have those dictionaries unspecified but discovered, even when specifying an explicit schema (eg it allows to have mixed dictionary / non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 (and I added a test about that in the PR fixing it (#6641 (comment)), but that got apparently lost in a rebase ;)), but I suppose this was changed after ensuring that the dictionary-typed partition fields "know" the full dictionary of all possible values the dataset (#7536 (comment)).
I can open a JIRA about this to discuss further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.

@jorisvandenbossche
Copy link
Member

I opened https://issues.apache.org/jira/browse/ARROW-11260 for the "require dictionaries or not" question

@bkietz bkietz closed this in eaa7b7a Jan 15, 2021
@bkietz bkietz deleted the 10247-Cannot-write-dataset-with branch February 25, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants