Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jan 22, 2021

Function::DispatchBest looks up a kernel with approximate type match; it may indicate that casts are required before execution can proceed:

const Function* add = ...
Int64Array lhs = ...
FloatArray rhs = ...
std::vector<ValueDescr> descrs{lhs.type(), rhs.type()};
ARROW_ASSIGN_OR_RAISE(auto kernel, add->DispatchBest(&descrs));
assert(descrs[0].type->Equals(float32()) && "lhs must be cast to float32 before the arrays may be added");

CallFunction now uses DispatchBest and performs implicit casts if necessary, so any consumer of CallFunction inherits support for autocasting:

import pyarrow as pa
import pyarrow.compute as pc
assert pc.add(pa.array([1]), pa.array([1.5])).equals(pa.array([2.5])) # cast lhs to float32

The DispatchBest API is also consumed by dataset::Expression::Bind, replacing the implicit cast special cases there.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a JIRA for this?

I'm not sure this is about scalars per se (though that may be an issue/part of the "best" solution). This seems to be about timestamp-string comparison/ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could additionally decide that it's acceptable to compare strings to timestamps, for example by attempting to implicitly parse a string column as timestamps. I'm not sure we want to allow this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change needed? (it feels it shouldn't be)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Previously we allowed the implicit cast from boolean to string, but after this PR comparing a column of type string to a boolean scalar is no longer supported. Column "boolean" has type string because no explicit schema is provided and it isn't inferred as integral, which leaves string.

Making matters worse: we can't provide an explicit schema because _ParquetDatasetV2 doesn't support that kwarg and ParquetDataset doesn't support a custom partitioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that makes sense.
In the old ParquetDataset code, this works because we simply always try to convert the value of the partition to the type of the value in the expression (although more logically would be the other way around, though ...). So the string "True" gets cast to a bool and then compared (which actually might also mean this never worked with False, as bool("False") is True ...)

p_value = f_type(self.levels[level]
.dictionary[p_value_index].as_py())

input value of the "expression" to the value in the partition dictionary. So even if the partitioning has a string "True", we will convert the True value to

@pitrou
Copy link
Member

pitrou commented Jan 26, 2021

@bkietz Could you rebase this and fix conflicts?

@bkietz bkietz force-pushed the 8919-Add-DispatchBest-APIs-to- branch from a5cd434 to cca9146 Compare January 27, 2021 18:03
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments. Is this PR finished?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure?

>>> a = 1<<28
>>> b = a + 1
>>> b > a
True
>>> np.float32(b) > np.float32(a)
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment isn't strictly valid for all possible values. This seems like a rather extreme edge case--I suggest we ticket this for followup. On balance this PR add lots of value and fixes many other (much more common) issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created https://issues.apache.org/jira/browse/ARROW-11562 to make this logic more robust

@bkietz bkietz force-pushed the 8919-Add-DispatchBest-APIs-to- branch from bead1db to 8ec575f Compare February 5, 2021 19:38
@bkietz
Copy link
Member Author

bkietz commented Feb 9, 2021

@pitrou PTAL

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's a bit weird that this happens for "add" rather than "add_checked", but we'll live with it :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on balance I think that for now at least it's fine to mark uint64 with "I hope you know what you're doing"

@pitrou pitrou force-pushed the 8919-Add-DispatchBest-APIs-to- branch from b2a0ff9 to 62a6b5e Compare February 10, 2021 14:19
@pitrou
Copy link
Member

pitrou commented Feb 10, 2021

Hmm, it seems there's an exception under Windows on AppVeyor...

@bkietz
Copy link
Member Author

bkietz commented Feb 10, 2021

MinGW failure is unrelated (S3 flake). Merging

@bkietz bkietz closed this in a554ff7 Feb 10, 2021
@bkietz bkietz deleted the 8919-Add-DispatchBest-APIs-to- branch February 10, 2021 21:48
@kou
Copy link
Member

kou commented Feb 14, 2021

@bkietz It seems that this causes nightly test-conda-cpp-valgrind job failure. Could you confirm it?

https://github.com/ursacomputing/crossbow/runs/1877315066#step:6:2817

[ RUN      ] TestCompareKernel.GreaterWithImplicitCasts
==6235== Conditional jump or move depends on uninitialised value(s)
==6235==    at 0x5163851: void arrow::internal::GenerateBitsUnrolled<arrow::compute::internal::applicator::ScalarBinary<arrow::BooleanType, arrow::Int32Type, arrow::Int32Type, arrow::compute::internal::(anonymous namespace)::Greater>::ArrayArray(arrow::compute::KernelContext*, arrow::ArrayData const&, arrow::ArrayData const&, arrow::Datum*)::{lambda()#1}>(unsigned char*, long, long, arrow::compute::internal::applicator::ScalarBinary<arrow::BooleanType, arrow::Int32Type, arrow::Int32Type, arrow::compute::internal::(anonymous namespace)::Greater>::ArrayArray(arrow::compute::KernelContext*, arrow::ArrayData const&, arrow::ArrayData const&, arrow::Datum*)::{lambda()#1}&&) (bitmap_generate.h:103)
==6235==    by 0x517445D: Write<arrow::compute::internal::applicator::ScalarBinary<OutType, Arg0Type, Arg1Type, Op>::ArrayArray(arrow::compute::KernelContext*, const arrow::ArrayData&, const arrow::ArrayData&, arrow::Datum*) [with OutType = arrow::BooleanType; Arg0Type = arrow::Int32Type; Arg1Type = arrow::Int32Type; Op = arrow::compute::internal::(anonymous namespace)::Greater]::<lambda()> > (codegen_internal.h:498)
==6235==    by 0x517445D: ArrayArray (codegen_internal.h:735)
==6235==    by 0x517445D: arrow::compute::internal::applicator::ScalarBinary<arrow::BooleanType, arrow::Int32Type, arrow::Int32Type, arrow::compute::internal::(anonymous namespace)::Greater>::Exec(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*) (codegen_internal.h:771)
==6235==    by 0x5047B59: std::_Function_handler<void (arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*), void (*)(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*)>::_M_invoke(std::_Any_data const&, arrow::compute::KernelContext*&&, arrow::compute::ExecBatch const&, arrow::Datum*&&) (std_function.h:300)
==6235==    by 0x4FF1CAC: std::function<void (arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*)>::operator()(arrow::compute::KernelContext*, arrow::compute::ExecBatch const&, arrow::Datum*) const (std_function.h:688)
==6235==    by 0x4FF16DC: ExecuteBatch (exec.cc:577)
==6235==    by 0x4FF16DC: arrow::compute::detail::(anonymous namespace)::ScalarExecutor::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::detail::ExecListener*) (exec.cc:515)
==6235==    by 0x4FF5AC7: arrow::compute::Function::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const (function.cc:193)
==6235==    by 0x4FE9B8D: arrow::compute::CallFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) (exec.cc:944)
==6235==    by 0x4FE9BC2: arrow::compute::CallFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) (exec.cc:940)
==6235==    by 0x3FDA3F: arrow::compute::(anonymous namespace)::CheckScalarNonRecursive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::shared_ptr<arrow::Array>, std::allocator<std::shared_ptr<arrow::Array> > > const&, std::shared_ptr<arrow::Array> const&, arrow::compute::FunctionOptions const*) (test_util.cc:50)
==6235==    by 0x40361C: arrow::compute::(anonymous namespace)::CheckScalar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<std::shared_ptr<arrow::Array>, std::allocator<std::shared_ptr<arrow::Array> > > const&, std::shared_ptr<arrow::Array>, arrow::compute::FunctionOptions const*) (test_util.cc:96)
==6235==    by 0x406060: arrow::compute::CheckScalarBinary(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<arrow::Array>, std::shared_ptr<arrow::Array>, std::shared_ptr<arrow::Array>, arrow::compute::FunctionOptions const*) (test_util.cc:175)
==6235==    by 0x314A76: arrow::compute::TestCompareKernel_GreaterWithImplicitCasts_Test::TestBody() (scalar_compare_test.cc:512)
==6235==    by 0x56E998D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x56E9BE0: testing::Test::Run() (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x56E9F0E: testing::TestInfo::Run() (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x56EA035: testing::TestSuite::Run() (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x56EA5EB: testing::internal::UnitTestImpl::RunAllTests() (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x56EA858: testing::UnitTest::Run() (in /opt/conda/envs/arrow/lib/libgtest.so)
==6235==    by 0x420107E: main (in /opt/conda/envs/arrow/lib/libgtest_main.so)
==6235== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:_ZN5arrow8internal20GenerateBitsUnrolledIZNS_7compute8internal10applicator12ScalarBinaryINS_11BooleanTypeENS_9Int32TypeES7_NS3_12_GLOBAL__N_17GreaterEE10ArrayArrayEPNS2_13KernelContextERKNS_9ArrayDataESF_PNS_5DatumEEUlvE_EEvPhllOT_
   fun:Write<arrow::compute::internal::applicator::ScalarBinary<OutType, Arg0Type, Arg1Type, Op>::ArrayArray(arrow::compute::KernelContext*, const arrow::ArrayData&, const arrow::ArrayData&, arrow::Datum*) [with OutType = arrow::BooleanType; Arg0Type = arrow::Int32Type; Arg1Type = arrow::Int32Type; Op = arrow::compute::internal::(anonymous namespace)::Greater]::<lambda()> >
   fun:ArrayArray
   fun:_ZN5arrow7compute8internal10applicator12ScalarBinaryINS_11BooleanTypeENS_9Int32TypeES5_NS1_12_GLOBAL__N_17GreaterEE4ExecEPNS0_13KernelContextERKNS0_9ExecBatchEPNS_5DatumE
   fun:_ZNSt17_Function_handlerIFvPN5arrow7compute13KernelContextERKNS1_9ExecBatchEPNS0_5DatumEEPS9_E9_M_invokeERKSt9_Any_dataOS3_S6_OS8_
   fun:_ZNKSt8functionIFvPN5arrow7compute13KernelContextERKNS1_9ExecBatchEPNS0_5DatumEEEclES3_S6_S8_
   fun:ExecuteBatch
   fun:_ZN5arrow7compute6detail12_GLOBAL__N_114ScalarExecutor7ExecuteERKSt6vectorINS_5DatumESaIS5_EEPNS1_12ExecListenerE
   fun:_ZNK5arrow7compute8Function7ExecuteERKSt6vectorINS_5DatumESaIS3_EEPKNS0_15FunctionOptionsEPNS0_11ExecContextE
   fun:_ZN5arrow7compute12CallFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaISA_EEPKNS0_15FunctionOptionsEPNS0_11ExecContextE
   fun:_ZN5arrow7compute12CallFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaISA_EEPKNS0_15FunctionOptionsEPNS0_11ExecContextE
   fun:_ZN5arrow7compute12_GLOBAL__N_123CheckScalarNonRecursiveERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_5ArrayEESaISD_EERKSD_PKNS0_15FunctionOptionsE
   fun:_ZN5arrow7compute12_GLOBAL__N_111CheckScalarENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorISt10shared_ptrINS_5ArrayEESaISB_EESB_PKNS0_15FunctionOptionsE
   fun:_ZN5arrow7compute17CheckScalarBinaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_5ArrayEES9_S9_PKNS0_15FunctionOptionsE
   fun:_ZN5arrow7compute47TestCompareKernel_GreaterWithImplicitCasts_Test8TestBodyEv
   fun:_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
   fun:_ZN7testing4Test3RunEv
   fun:_ZN7testing8TestInfo3RunEv
   fun:_ZN7testing9TestSuite3RunEv
   fun:_ZN7testing8internal12UnitTestImpl11RunAllTestsEv
   fun:_ZN7testing8UnitTest3RunEv
   fun:main
}
...

sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Feb 17, 2021
…omodate implicit casts

`Function::DispatchBest` looks up a kernel with approximate type match; it may indicate that casts are required before execution can proceed:

```c++
const Function* add = ...
Int64Array lhs = ...
FloatArray rhs = ...
std::vector<ValueDescr> descrs{lhs.type(), rhs.type()};
ARROW_ASSIGN_OR_RAISE(auto kernel, add->DispatchBest(&descrs));
assert(descrs[0].type->Equals(float32()) && "lhs must be cast to float32 before the arrays may be added");
```

CallFunction now uses DispatchBest and performs implicit casts if necessary, so any consumer of CallFunction inherits support for autocasting:

```python
import pyarrow as pa
import pyarrow.compute as pc
assert pc.add(pa.array([1]), pa.array([1.5])).equals(pa.array([2.5])) # cast lhs to float32
```

The DispatchBest API is also consumed by `dataset::Expression::Bind`, replacing the implicit cast special cases there.

Closes apache#9294 from bkietz/8919-Add-DispatchBest-APIs-to-

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz
Copy link
Member Author

bkietz commented Feb 19, 2021

@kou indeed, and I have an open PR to fix that #9471

@kou
Copy link
Member

kou commented Feb 19, 2021

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants