-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries #4316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cpp/src/arrow/array-dict-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to type-test.cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems equivalent to Cast(array=Take(indices=transpose_map, values=data_), to=out_index_type). Should we add an explicit output type to TakeOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but out of scope for this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened https://issues.apache.org/jira/browse/ARROW-5343 which would be a pre-requisite for this
cpp/src/arrow/array.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use child_data[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed on the mailing list. I agree with the others (Antoine / Micah) that having an explicit dictionary field is more clear. I added a benchmark to assess if it causes meaningful overhead which it does not seem to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use TypeError here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave the semantics here as unchanged as possible from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went ahead and changed
cpp/src/arrow/type_traits.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're renaming this one, should we do the same for is_signed_integer etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. There's some other cleanup to do but not here, I will open a JIRA
cpp/src/arrow/util/concatenate.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this refactor DictionaryType looks more like a nested type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a synthetic construct in C++ since there is no Dictionary type in the protocol metadata...
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refrained from looking at the IPC implementation details for now.
The one thing I'm worried about is that DictionaryMemo now must be handled by all users of the IPC layer (or their system will be incompatible with dict arrays).
cpp/src/arrow/builder-benchmark.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I'll remove it. I found that the roundtrip cost of an empty ArrayData is about 60ns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnpackHelper implementations could use ArrayDataVisitor from visitor_inline.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they could. This patch removes some prior code duplication but otherwise roughly maintains the status quo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use checked_cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/flight/server.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR, our make_unique implementation is buggy on MSVC. @bkietz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accurate; MSVC will occasionally get confused between internal::make_unique and std::make_unique when it is usinged like this. Referring to it with internal::make_unique prevents the issue https://issues.apache.org/jira/browse/ARROW-5121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed using statement
cpp/src/arrow/flight/server.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the details of the IPC stream format leaking into all IPC users.
Can we implement an IpcPayloadWriter instead and rely on OpenRecordBatchWriter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's address shortly after this is merged
cpp/src/arrow/flight/server.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user shouldn't have to implement this and deal with the specifics of dictionary transmission over the wire.
cc @lidavidm for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or is it @lihalite ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either will ping me!
I think in this case, it's OK, in that Flight explicitly states we transmit the schema first, then data; also, if we have a set of reasonable implementations of this interface, users should hopefully not feel a need to implement it themselves unless they did actually want control over the specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if you look at the RecordBatchStream implementation, it's doing non-trivial stuff with dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have to develop some utility code to assist with these details, but it doesn't seem urgent for the moment. If you are creating a custom stream I am not sure right now how to protect the developer from the details of the stream protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the necessary to hide the details in IpcPayloadWriter, and the converse is available for reading in ipc::MessageReader together with RecordBatchStreamReader. There shouldn't be a need to expose this at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, how would you like to address this, sequence-wise -- is only the implementation here a problem, or is the public API also a problem? I think it would be much easier to fix both after getting this patch merged since we aren't releasing anytime soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's fix this afterwards.
cpp/src/arrow/ipc/read-write-test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test something about in_memo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assertion that in_memo and out_memo agree about the number of dictionaries
cpp/src/arrow/python/flight.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. This seems to populate a DictionaryMemo but it's not used afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a test for dictionary arrays in test_flight.py...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I tried, cross-language Flight with dictionaries still didn't work, or even from C++ to C++, so it wouldn't have worked before in Python. https://issues.apache.org/jira/browse/ARROW-5143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... They work with DoGet but perhaps not with DoPut then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lihalite I'm fixing dict transfer with DoPut as part of ARROW-5113. It may produce conflicts for both you and @wesm :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you may want to abort the dict transfer work until this is merged
cpp/src/arrow/flight/server.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #4319 for a clean, dictionary-compatible, re-implementation of FlightMessageReader based on ipc::MessageReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's excellent, thank you
|
I spoke with @romainfrancois and he may not be able to help fix the R bindings until next week, so if it doesn't offend anyone greatly I would add R to allowed failures in Travis CI until R can be fixed |
|
Looks like there are still some doxygen issues, and the integration tests are broken (I was hoping I would get lucky there...) so I will address those issues tomorrow (Thursday) |
|
I'll work on this in a few days. |
|
OK. |
|
This may be a out of scope topic of this pull request but I want to share. Can we use With this change, can we use If this is out of scope of this pull request, I'll open a JIRA issue. |
|
@wesm I'm currently trying to rebase this, and also fixing CUDA compile failures. Edit: done. |
|
I think the integration failure is because the dictionary integration test reuses the same dictionary array for two different fields (with different index types): {
"schema": {
"fields": [
{
"name": "dict1_0",
"type": {
"name": "utf8"
},
"nullable": true,
"children": [],
"dictionary": {
"id": 0,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 8
},
"isOrdered": false
}
},
{
"name": "dict1_1",
"type": {
"name": "utf8"
},
"nullable": true,
"children": [],
"dictionary": {
"id": 0,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 32
},
"isOrdered": false
}
},
{
"name": "dict2_0",
"type": {
"name": "int",
"isSigned": true,
"bitWidth": 64
},
"nullable": true,
"children": [],
"dictionary": {
"id": 1,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 16
},
"isOrdered": false
}
}
]
},
"dictionaries": [
{
...A complication is with how the dictionary types are serialized into JSON. The "dictionaries" key doesn't allow unserializing the dictionary arrays by themselves, as you have to parse the "schema" key to get the value type... |
|
@kou the troublesome thing with that is the |
|
@pitrou I can look at the integration test failure today -- thank you for rebasing! |
|
So the problem is that Either we change the JSON integration tests, or we need to change the IPC layer to accomodate this non-bijection. |
|
Yes, indeed. I'm on it, I will fix. |
|
OK, I have the integration tests fixed locally. Now multiple fields can reference the same dictionary with no problem. I'll add a unit test for the change I made to |
|
Done. I also added rudimentary arguments to toggle the JS and Java integration testers off, it might be worth looking a bit more holistically at integration test CLI options per ARROW-5066 |
|
Integration tests are failing with this error: With these changes it is now more difficult to refer to dictionaries multiple times in IPC streams because ids are assigned to fields prior to becoming aware of the dictionaries themselves. I opened ARROW-5340 to spend some time on this -- I'm inclined to remove the multiply-referenced dictionary from the integration tests and leave it for follow up work Another option is to change Java to not perform assertions on the dictionary id's when comparing schemas |
|
@wesm I've created a new issue for #4316 (comment) : https://issues.apache.org/jira/browse/ARROW-5355 |
|
I removed the multiply-referenced dictionary from the integration tests. I think the dictionary-encoding stuff in Java will need a little bit of work -- it isn't clear to me, for example, why |
|
I'll work on this. |
|
Done. |
More refactoring Continued refactoring Begin removing fixed dictionaries from codebase Fix up Unify implementation and tests More refactoring, consolidation Revert changes to builder_dict.*
…low on work around this can be done
|
@xhochy if you are available to peek at this or approve, that would be helpful. I'm happy to address post-merge feedback as well |
Hi @wesm, looks like an issue with a dependency. I'll investigate (https://travis-ci.org/apache/arrow/jobs/533676609#L536) Corresponding issue in the rustyline repo: kkawakam/rustyline#217. I'm checking what's changed with the latest nightly. CC @andygrove |
|
I'm inclined to merge this with the Rust build broken since there are a lot of PRs that need to be rebased... if anyone has any objection or wants to look more at the changes please let me know in the next hour or two |
|
I've logged https://issues.apache.org/jira/browse/ARROW-5360
@wesm I'm happy that we don't stop the train for other languages because of the Rust issue. It's only occurring on nightly, and we at least know what the issue is. |
|
Merging so we can begin rebasing other PRs |
…ROW-3144 At the moment however, all the `DictionaryMemo` use is internal, it should probably be promoted to arguments (with defaults) to the R functions. I'll do this here or on another PR if this one is merged first so that `r/` builds again on travis. This now needs the C++ lib up to date, e.g. on my setup I get it through `brew install apache-arrow --HEAD`, and there is no conditional compiling so that it still works with previous versions. Let me know if that's ok. follow up from #4316 Author: Romain Francois <romain@rstudio.com> Closes #4413 from romainfrancois/ARROW-5361/dictionary and squashes the following commits: b0de1a8 <Romain Francois> R should pass now 2556c16 <Romain Francois> document() fa0440f <Romain Francois> update R to changes from ARROW-3144 #4316
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to #4316. Forked from #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25 <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316. Forked from apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to apache/arrow#4316. Forked from apache/arrow#4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: apache/arrow@b12d842 Author: ptaylor <paul.e.taylor@me.com> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25bd <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
This patch moves the dictionary member out of DictionaryType to a new
member on the internal ArrayData structure. As a result, serializing
and deserializing schemas requires only a single IPC message, and
schemas have no knowledge of what the dictionary values are.
The objective of this change is to correct a long-standing Arrow C++
design problem with dictionary-encoded arrays where the dictionary
values must be known at schema construction time. This has plagued us
all over the codebase:
simple because each row group may have a different dictionary
invalidate the pre-existing schema, causing subsequent RecordBatch
objects to be incompatible
sent, having possibly unbounded size.
require an expensive type unification
The summary of what can be learned from this is: do not put data in
type objects, only metadata. Dictionaries are data, not metadata.
There are a number of unavoidable API changes (straightforward for
library users to fix) but otherwise no functional difference in the
library.
As you can see the change is quite complex as significant parts of IPC
read/write, JSON integration testing, and Flight needed to be reworked
to alter the control flow around schema resolution and handling the
first record batch.
Key APIs changed
DictionaryTypeconstructor requires aDataTypefor thedictionary value type instead of the dictionary itself. The
dictionaryfactory method is correspondingly changed. Thedictionaryaccessor method onDictionaryTypeis replaced withvalue_type.DictionaryArrayconstructor andDictionaryArray::FromArraysmustbe passed the dictionary values as an additional argument.
DictionaryMemois exposed in the public API as it is now requiredfor granular interactions with IPC messages with such functions as
ipc::ReadSchemaandipc::ReadRecordBatchDictionaryMemo*argument is added to several low-level publicfunctions in
ipc/writer.handipc/reader.hSome other incidental changes:
Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate)
As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it
R is added to allow_failures in Travis CI for now