ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130

bkietz · 2021-01-07T19:55:17Z

Enables usage of dictionary columns as partition columns on write.

Additionally resolves some partition-related follow ups from #8894 (@pitrou):

raise an error status instead of aborting for overflowing maximum group count
handle dictionary index types other than int32
don't build an unused null bitmap in CountsToOffsets
improve docstrings for MakeGroupings, ApplyGroupings

At some point, we'll probably want to support null grouping criteria. (For now, this PR adds a test asserting that nulls in any grouping column raise an error.) This will require adding an option/overload/... of dictionary_encode which places nulls in the dictionary instead of the indices, and ensuring Partitionings can format nulls appropriately. This would allow users to write a partitioned dataset which preserves nulls sensibly:

data/
    col=a/
        part-0.parquet # col is "a" throughout
    col=b/
        part-1.parquet # col is "b" throughout
    part-2.parquet # col is null throughout

github-actions · 2021-01-07T19:55:45Z

https://issues.apache.org/jira/browse/ARROW-10247

bkietz · 2021-01-07T20:43:07Z

non apache CI: https://github.com/bkietz/arrow/runs/1664709650

cpp/src/arrow/dataset/partition.cc

pitrou · 2021-01-11T15:35:32Z

cpp/src/arrow/dataset/partition.cc

Can you comment on this? It's not obvious why we're limited by the size of an int16_t.

I picked it arbitrarily, to be honest. A huge number of groups seemed likely to be an error to see who would ask about it. Should we instead allow the full range of int32? @jorisvandenbossche

For now I'll remove the constant kMaxGroups and allow any int32 group count.

You never know if someone has a strange use case requiring a lot of groups, so if there is not a technical reason, I think it's good to just allow it

I don't know, is a separate file created for each group? If so, it makes sense to put a configurable limit.

Yes, at least one file for each group

Then it's definitely worth having a reasonably small configurable limit (such as 100). I suspect it's easy to end up with Arrow creating a million files if you do a mistake in choosing your partition columns.

As long as it is configurable, then that is fine for me.
But I think something like 100 is too small. For example, the NYC taxi dataset partitioned by year + month for 8 years of data already has 8*12 = 96 groups. And I think partitioning by day is not that uncommon in practice for big data (although for those cases you will probably not write that all at once)

I'll add max groups as a member of WriteOptions

cpp/src/arrow/dataset/partition.h

bkietz · 2021-01-11T17:47:05Z

non apache CI: https://github.com/bkietz/arrow/runs/1682804367

jorisvandenbossche

Thanks! A few questions about how to interact with this as a user (the dictionaries in the Partitioning API)

cpp/src/arrow/dataset/partition.h

jorisvandenbossche · 2021-01-11T19:08:35Z

cpp/src/arrow/dataset/partition.h

Nice docstrings! Those examples help a lot

python/pyarrow/_dataset.pyx

jorisvandenbossche · 2021-01-11T19:20:28Z

python/pyarrow/tests/test_dataset.py

Are you required to pass those dictionaries for writing?
Based on a quick test locally, it seems to work without as well?

Or even more, it doesn't seem to impact the result if I set different values there than the actual values in the partition column

That's surprising; you should see errors: No dictionary provided for dictionary field part: dictionary<values=string, indices=int32, ordered=0> if you specify an incorrect dictionary and Dictionary supplied for field part: dictionary<values=string, indices=int32, ordered=0> does not contain 'a' if you specify a dictionary which doesn't include all the column's values

For now I've updated the docstring and put a comment in the test to indicate that dictionaries are required for parsing dict fields.

bkietz · 2021-01-13T17:50:07Z

@jorisvandenbossche @pitrou addressed comments, PTAL

jorisvandenbossche · 2021-01-14T08:49:17Z

@bkietz Thanks for the updates! I added one more test with the example from the JIRA, so to also have a test that ensures writing a table also works without specifying the dictionaries.

Is it possible to specify the max_partitions option from python?

pitrou · 2021-01-14T13:50:25Z

python/pyarrow/_dataset.pyx

dictionary is ignored here? That doesn't sound right.

I'll adopt the Dict[str, Array] pattern, which will remove this discrepancy from the python interface.

The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.

pitrou · 2021-01-14T13:51:23Z

python/pyarrow/_dataset.pyx

This seems a bit weird and inconvenient as an API. Why not accept a Dict[str, Array] mapping field names to dictionaries?

cpp/src/arrow/dataset/file_base.cc

pitrou · 2021-01-14T13:54:34Z

While it's not a blocker, I admit I don't understand why it's necessary to pass dictionary values, and in which case (only for writing?).

bkietz · 2021-01-14T16:37:04Z

@pitrou The dictionaries are a feature inspired by ParquetDataset: it's useful for each partition expression to contain the dictionary of all unique values that field could take. They are only required when parsing paths. When constructing a Partitioning from a factory (inferring fields from a vector of paths) the dictionaries are assembled automatically. However if the Partitioning is being directly constructed then the dictionaries must be explicitly specified.

@jorisvandenbossche I'll add a binding for max_partitions to python

…ictionary columns

…ries

jorisvandenbossche · 2021-01-15T10:11:18Z

Since both @pitrou and I got a bit confused about this dictionaries, we should maybe try to further clarify the documentation around it. A more explicit test could maybe also help (although I didn't check the C++ tests).
(can be a follow-up, since I think needs to get in for the release?)

As I understand it now:

It's only needed in ds.dataset(..) when passing a schema, i.e. which creates an actual Partitioning object, and not a PartitioningFactory (which will infer the schema (and the dictionary values) from the file paths)
In addition, it's only needed to specify it when reading, and not when writing with a Partitioning (so can create and use a schema-based Partitioning object without specifying the dictionaries).
This is the "only required when parsing paths" (the docstring says "or an error will be raised in parsing"), since we don't need to parse paths when writing. But for a user the "parsing paths" == "reading" (in practice) might not necessarily be clear.

This behaviour of requiring explicit dictionaries when reading a dataset with a Partitioning object with a schema including dictionary fields already exists in 1.0 and 2.0 (only without any way to get around the error "No dictionary provided for dictionary field", except by letting the partitioning be discovered instead of specifying a schema). So that's certainly fine for 3.0.0 as well.

But, I am personally still wondering if we can't allow this for reading as well to have those dictionaries unspecified but discovered, even when specifying an explicit schema (eg it allows to have mixed dictionary / non-dictionary partition fields). This actually also worked in pyarrow 0.17.0 (and I added a test about that in the PR fixing it (#6641 (comment)), but that got apparently lost in a rebase ;)), but I suppose this was changed after ensuring that the dictionary-typed partition fields "know" the full dictionary of all possible values the dataset (#7536 (comment)).
I can open a JIRA about this to discuss further.

jorisvandenbossche · 2021-01-15T10:13:15Z

python/pyarrow/_dataset.pyx

The entry in dictionaries is still ignored when the field is not of dictionary type. Which is a user error of course, but we could maybe raise an exception in that case instead of silently ignoring it.

jorisvandenbossche · 2021-01-15T11:25:50Z

I opened https://issues.apache.org/jira/browse/ARROW-11260 for the "require dictionaries or not" question

bkietz requested a review from jorisvandenbossche January 7, 2021 19:55

github-actions bot added Component: C++ Component: Python labels Jan 7, 2021

pitrou reviewed Jan 11, 2021

View reviewed changes

jorisvandenbossche reviewed Jan 11, 2021

View reviewed changes

pitrou reviewed Jan 14, 2021

View reviewed changes

cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved

bkietz and others added 8 commits January 14, 2021 20:35

ARROW-10247: [C++][Dataset] Support writing datasets partitioned on d…

f450041

…ictionary columns

remove kMaxGroups, fix typos

82b2fe4

remove group overflow test

f871049

clarify that dictionaries are required for Parsing dict fields

8ee659d

add max_partitions to FileSystemDatasetWriteOptions

8a8faec

add additional python test with roundtrip without specifying dictiona…

0e13c31

…ries

add docstring entry also to dataset.py::partitioning

1dff216

review comments

2cdb375

bkietz force-pushed the 10247-Cannot-write-dataset-with branch from c116ed2 to 2cdb375 Compare January 15, 2021 01:46

jorisvandenbossche approved these changes Jan 15, 2021

View reviewed changes

bkietz closed this in eaa7b7a Jan 15, 2021

bkietz deleted the 10247-Cannot-write-dataset-with branch February 25, 2021 16:11

This was referenced Jan 15, 2021

[C++][Dataset] Cannot write dataset with dictionary column as partition field #26244

Closed

[C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning #27161

Closed

ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130

ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns #9130

Uh oh!

Conversation

bkietz commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2021

Uh oh!

bkietz commented Jan 7, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkietz commented Jan 11, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Jan 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz commented Jan 13, 2021

Uh oh!

jorisvandenbossche commented Jan 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Jan 14, 2021

Uh oh!

bkietz commented Jan 14, 2021

Uh oh!

jorisvandenbossche commented Jan 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

bkietz commented Jan 7, 2021 •

edited

Loading

bkietz Jan 12, 2021 •

edited

Loading

bkietz Jan 14, 2021 •

edited

Loading