Skip to content

GH-39231: [C++][Compute] Add binary_slice kernel for fixed size binary#39245

Merged
pitrou merged 12 commits intoapache:mainfrom
js8544:jinshang/binary_slice_fsb
Jan 11, 2024
Merged

GH-39231: [C++][Compute] Add binary_slice kernel for fixed size binary#39245
pitrou merged 12 commits intoapache:mainfrom
js8544:jinshang/binary_slice_fsb

Conversation

@js8544
Copy link
Copy Markdown
Contributor

@js8544 js8544 commented Dec 16, 2023

Rationale for this change

Add binary_slice kernel for fixed size binary

What changes are included in this PR?

Add binary_slice kernel for fixed size binary

Are these changes tested?

Yes

Are there any user-facing changes?

No

@js8544 js8544 force-pushed the jinshang/binary_slice_fsb branch from 3f89475 to 097324d Compare December 16, 2023 03:55
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39231 has been automatically assigned in GitHub to PR creator.

@js8544 js8544 force-pushed the jinshang/binary_slice_fsb branch from 097324d to f9fd9f5 Compare December 16, 2023 03:55
@pitrou pitrou changed the title GH-39231: [Compute] Add binary_slice kernel for fixed size binary GH-39231: [C++][Compute] Add binary_slice kernel for fixed size binary Dec 21, 2023
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @js8544 ! Here are some comments.

Comment on lines 2593 to 2605
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you rephrase the code such that step > 0 and step < 0 have separate code paths? It would make things clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
TEST_F(TestFixedSizeBinaryKernels, SliceBytesBasic) {
TEST_F(TestFixedSizeBinaryKernels, BinarySliceBasic) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use less repeated values to make sure the substring selection is right?

Suggested change
CheckUnary("binary_slice", R"(["abcabc", "defdef"])", fixed_size_binary(2),
CheckUnary("binary_slice", R"(["abcdef", "ghijkl"])", fixed_size_binary(2),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add some null values at some point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 21, 2023
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current FixedSizeBinaryTransformExecBase deals with offsets wrongly. I fixed the bug and added a test for binary_replace_slice as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thank you!

@js8544 js8544 requested a review from pitrou December 22, 2023 07:08
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jan 9, 2024

I think this PR produces incorrect results sometimes:

>>> pc.binary_slice(pa.array([b"ab\xf2de"], pa.binary(5)), -5, -6, -3)
Traceback (most recent call last):
  Cell In[16], line 1
    pc.binary_slice(pa.array([b"ab\xf2de"], pa.binary(5)), -5, -6, -3)
  File ~/arrow/dev/python/pyarrow/compute.py:263 in wrapper
    return func.call(args, options, memory_pool)
  File pyarrow/_compute.pyx:385 in pyarrow._compute.Function.call
    result = GetResultValue(
  File pyarrow/error.pxi:154 in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File pyarrow/error.pxi:91 in pyarrow.lib.check_status
    raise convert_status(status)
ArrowInvalid: Invalid UTF8 sequence in input

This is when adding this validation test in Python:

diff --git a/python/pyarrow/tests/test_compute.py b/python/pyarrow/tests/test_compute.py
index 7c5a134d33..d1eb605c71 100644
--- a/python/pyarrow/tests/test_compute.py
+++ b/python/pyarrow/tests/test_compute.py
@@ -561,7 +561,8 @@ def test_slice_compatibility():
 
 
 def test_binary_slice_compatibility():
-    arr = pa.array([b"", b"a", b"a\xff", b"ab\x00", b"abc\xfb", b"ab\xf2de"])
+    data = [b"", b"a", b"a\xff", b"ab\x00", b"abc\xfb", b"ab\xf2de"]
+    arr = pa.array(data)
     for start, stop, step in itertools.product(range(-6, 6),
                                                range(-6, 6),
                                                range(-3, 4)):
@@ -574,6 +575,13 @@ def test_binary_slice_compatibility():
         assert expected.equals(result)
         # Positional options
         assert pc.binary_slice(arr, start, stop, step) == result
+        # Fixed size binary input / output
+        for item in data:
+            fsb_scalar = pa.scalar(item, type=pa.binary(len(item)))
+            expected = item[start:stop:step]
+            actual = pc.binary_slice(fsb_scalar, start, stop, step)
+            assert actual.type == pa.binary(len(expected))
+            assert actual.as_py() == expected
 
 
 def test_split_pattern():

@js8544
Copy link
Copy Markdown
Contributor Author

js8544 commented Jan 11, 2024

There was a bug in my implementation. I refactored and reused the range computing logic in the kernel exec function. And I also added tests like the one you gave above, in both C++ and Python.

js8544 and others added 2 commits January 12, 2024 01:17
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@js8544
Copy link
Copy Markdown
Contributor Author

js8544 commented Jan 11, 2024

Updated. Thanks for pointing out!

@pitrou pitrou merged commit 2b4a703 into apache:main Jan 11, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 11, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 2b4a703.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… binary (apache#39245)

### Rationale for this change
Add binary_slice kernel for fixed size binary

### What changes are included in this PR?
Add binary_slice kernel for fixed size binary

### Are these changes tested?
Yes

### Are there any user-facing changes?
No

* Closes: apache#39231

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Compute] binary_slice function should support fixed-size binary arrays

2 participants