-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Take() currently concatenates ChunkedArrays first. However, this breaks down when calling Take() from a ChunkedArray or Table where concatenating the arrays would result in an array that's too large. While inconvenient to implement, it would be useful if this case were handled.
This could be done as a higher-level wrapper around Take(), perhaps.
Example in Python:
>>> import pyarrow as pa
>>> pa.__version__
'1.0.0'
>>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
>>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
>>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
>>> table.take([1, 0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
File "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py", line 268, in take
return call_function('take', [data, indices], options)
File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arraysIn this example, it would be useful if Take() or a higher-level wrapper could generate multiple record batches as output.
Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127
Related issues:
- [C++] Query engine umbrella issue #28385 (is a child of)
- [C++] Take on string chunked arrays slow and fails #26738 (duplicate)
- [Python] take function doesn't work when table has large row counts #31249 (duplicate)
- [C++] Take on string chunked arrays slow and fails #26738 (duplicate)
- [Python] Pyarrow DictionaryArray.dictionary_decode mangling strings #34583 (duplicate)
- [Python] Too much RAM consumption when using
takeon a memory-mapped table #37766 (relates to) - [C++] Allow automatic String -> LargeString promotions when concatenating tables #23539 (relates to)
- [C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays #33049 (relates to)
- [C++] TakeCC is doing indices.num_chunks() Concatenate() calls when it could be doing only one #40207 (relates to)
PRs and other links:
Note: This issue was originally created as ARROW-9773. Please see the migration documentation for further details.
kdkavanagh, pcmoritz, dmazin and adams-brian