-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8919: [C++][Compute][Dataset] Add Function::DispatchBest to accomodate implicit casts #9294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d848fd0 to
a5cd434
Compare
r/tests/testthat/test-dataset.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a JIRA for this?
I'm not sure this is about scalars per se (though that may be an issue/part of the "best" solution). This seems to be about timestamp-string comparison/ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could additionally decide that it's acceptable to compare strings to timestamps, for example by attempting to implicitly parse a string column as timestamps. I'm not sure we want to allow this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this change needed? (it feels it shouldn't be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Previously we allowed the implicit cast from boolean to string, but after this PR comparing a column of type string to a boolean scalar is no longer supported. Column "boolean" has type string because no explicit schema is provided and it isn't inferred as integral, which leaves string.
Making matters worse: we can't provide an explicit schema because _ParquetDatasetV2 doesn't support that kwarg and ParquetDataset doesn't support a custom partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that makes sense.
In the old ParquetDataset code, this works because we simply always try to convert the value of the partition to the type of the value in the expression (although more logically would be the other way around, though ...). So the string "True" gets cast to a bool and then compared (which actually might also mean this never worked with False, as bool("False") is True ...)
arrow/python/pyarrow/parquet.py
Lines 951 to 952 in 4086409
| p_value = f_type(self.levels[level] | |
| .dictionary[p_value_index].as_py()) |
input value of the "expression" to the value in the partition dictionary. So even if the partitioning has a string "True", we will convert the True value to
|
@bkietz Could you rebase this and fix conflicts? |
a5cd434 to
cca9146
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments. Is this PR finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure?
>>> a = 1<<28
>>> b = a + 1
>>> b > a
True
>>> np.float32(b) > np.float32(a)
FalseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the comment isn't strictly valid for all possible values. This seems like a rather extreme edge case--I suggest we ticket this for followup. On balance this PR add lots of value and fixes many other (much more common) issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created https://issues.apache.org/jira/browse/ARROW-11562 to make this logic more robust
bead1db to
8ec575f
Compare
|
@pitrou PTAL |
This reverts commit 2d26bf199d36b807d378c7ecc321b86f9922867b.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It's a bit weird that this happens for "add" rather than "add_checked", but we'll live with it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on balance I think that for now at least it's fine to mark uint64 with "I hope you know what you're doing"
b2a0ff9 to
62a6b5e
Compare
|
Hmm, it seems there's an exception under Windows on AppVeyor... |
…on temporary for msvc
|
MinGW failure is unrelated (S3 flake). Merging |
|
@bkietz It seems that this causes nightly test-conda-cpp-valgrind job failure. Could you confirm it? https://github.com/ursacomputing/crossbow/runs/1877315066#step:6:2817 |
…omodate implicit casts
`Function::DispatchBest` looks up a kernel with approximate type match; it may indicate that casts are required before execution can proceed:
```c++
const Function* add = ...
Int64Array lhs = ...
FloatArray rhs = ...
std::vector<ValueDescr> descrs{lhs.type(), rhs.type()};
ARROW_ASSIGN_OR_RAISE(auto kernel, add->DispatchBest(&descrs));
assert(descrs[0].type->Equals(float32()) && "lhs must be cast to float32 before the arrays may be added");
```
CallFunction now uses DispatchBest and performs implicit casts if necessary, so any consumer of CallFunction inherits support for autocasting:
```python
import pyarrow as pa
import pyarrow.compute as pc
assert pc.add(pa.array([1]), pa.array([1.5])).equals(pa.array([2.5])) # cast lhs to float32
```
The DispatchBest API is also consumed by `dataset::Expression::Bind`, replacing the implicit cast special cases there.
Closes apache#9294 from bkietz/8919-Add-DispatchBest-APIs-to-
Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
|
Thanks! |
Function::DispatchBestlooks up a kernel with approximate type match; it may indicate that casts are required before execution can proceed:CallFunction now uses DispatchBest and performs implicit casts if necessary, so any consumer of CallFunction inherits support for autocasting:
The DispatchBest API is also consumed by
dataset::Expression::Bind, replacing the implicit cast special cases there.