GH-39231: [C++][Compute] Add binary_slice kernel for fixed size binary#39245
GH-39231: [C++][Compute] Add binary_slice kernel for fixed size binary#39245pitrou merged 12 commits intoapache:mainfrom
Conversation
3f89475 to
097324d
Compare
|
|
097324d to
f9fd9f5
Compare
There was a problem hiding this comment.
Hmm, can you rephrase the code such that step > 0 and step < 0 have separate code paths? It would make things clearer.
There was a problem hiding this comment.
Nit
| TEST_F(TestFixedSizeBinaryKernels, SliceBytesBasic) { | |
| TEST_F(TestFixedSizeBinaryKernels, BinarySliceBasic) { |
There was a problem hiding this comment.
Can we use less repeated values to make sure the substring selection is right?
| CheckUnary("binary_slice", R"(["abcabc", "defdef"])", fixed_size_binary(2), | |
| CheckUnary("binary_slice", R"(["abcdef", "ghijkl"])", fixed_size_binary(2), |
There was a problem hiding this comment.
Also, can you add some null values at some point?
There was a problem hiding this comment.
The current FixedSizeBinaryTransformExecBase deals with offsets wrongly. I fixed the bug and added a test for binary_replace_slice as well.
|
I think this PR produces incorrect results sometimes: >>> pc.binary_slice(pa.array([b"ab\xf2de"], pa.binary(5)), -5, -6, -3)
Traceback (most recent call last):
Cell In[16], line 1
pc.binary_slice(pa.array([b"ab\xf2de"], pa.binary(5)), -5, -6, -3)
File ~/arrow/dev/python/pyarrow/compute.py:263 in wrapper
return func.call(args, options, memory_pool)
File pyarrow/_compute.pyx:385 in pyarrow._compute.Function.call
result = GetResultValue(
File pyarrow/error.pxi:154 in pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File pyarrow/error.pxi:91 in pyarrow.lib.check_status
raise convert_status(status)
ArrowInvalid: Invalid UTF8 sequence in inputThis is when adding this validation test in Python: diff --git a/python/pyarrow/tests/test_compute.py b/python/pyarrow/tests/test_compute.py
index 7c5a134d33..d1eb605c71 100644
--- a/python/pyarrow/tests/test_compute.py
+++ b/python/pyarrow/tests/test_compute.py
@@ -561,7 +561,8 @@ def test_slice_compatibility():
def test_binary_slice_compatibility():
- arr = pa.array([b"", b"a", b"a\xff", b"ab\x00", b"abc\xfb", b"ab\xf2de"])
+ data = [b"", b"a", b"a\xff", b"ab\x00", b"abc\xfb", b"ab\xf2de"]
+ arr = pa.array(data)
for start, stop, step in itertools.product(range(-6, 6),
range(-6, 6),
range(-3, 4)):
@@ -574,6 +575,13 @@ def test_binary_slice_compatibility():
assert expected.equals(result)
# Positional options
assert pc.binary_slice(arr, start, stop, step) == result
+ # Fixed size binary input / output
+ for item in data:
+ fsb_scalar = pa.scalar(item, type=pa.binary(len(item)))
+ expected = item[start:stop:step]
+ actual = pc.binary_slice(fsb_scalar, start, stop, step)
+ assert actual.type == pa.binary(len(expected))
+ assert actual.as_py() == expected
def test_split_pattern(): |
cf72c91 to
bec3747
Compare
|
There was a bug in my implementation. I refactored and reused the range computing logic in the kernel exec function. And I also added tests like the one you gave above, in both C++ and Python. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
Updated. Thanks for pointing out! |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 2b4a703. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
… binary (apache#39245) ### Rationale for this change Add binary_slice kernel for fixed size binary ### What changes are included in this PR? Add binary_slice kernel for fixed size binary ### Are these changes tested? Yes ### Are there any user-facing changes? No * Closes: apache#39231 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Add binary_slice kernel for fixed size binary
What changes are included in this PR?
Add binary_slice kernel for fixed size binary
Are these changes tested?
Yes
Are there any user-facing changes?
No