GH-40059: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor#40064
GH-40059: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor#40064jorisvandenbossche merged 12 commits intoapache:mainfrom
Conversation
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
|
@pitrou would you have time to look at this PR please? =) |
paleolimbot
left a comment
There was a problem hiding this comment.
I am not too familiar with this part of the code base; however, I found this clean, clear, and easy to read!
I might suggest slightly fewer uses of auto (e.g., I don't remember if arr.data() is an ArrayData or a std::shared_ptr<ArrayData>); however, the usage here is definitely consistent with the rest of the Arrow code base.
Thank you for looking at the PR!
True, will try to do better in this regard 👍 |
| auto data = Buffer::Wrap(f_values); | ||
|
|
||
| std::shared_ptr<Tensor> tensor_expected; | ||
| ASSERT_OK_AND_ASSIGN(tensor_expected, Tensor::Make(float32(), data, shape, f_strides)); |
There was a problem hiding this comment.
Would you mind writing or filing an issue for TensorFromJSON()? I think it could make these tests easier to read and write
There was a problem hiding this comment.
Great idea, will open an issue and try to work on it! 👍
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
zeroshade
left a comment
There was a problem hiding this comment.
Just a few nitpicks and questions, but otherwise looks great to me
|
|
||
| def to_tensor(self): | ||
| """ | ||
| Convert to a :class:`~pyarrow.Tensor`. |
There was a problem hiding this comment.
For a next PR, you can document a bit more in detail the constraints of when this will work
|
Given the approvals, going to merge this to unblock follow-up PRs |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 3ba6d28. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…or - add support for different data types (#40359) ### What changes are included in this PR? - Added support for `RecordBatches` with fields of different type in the conversion `RecordBatch` → `Tensor`. - Added detail of the constraints to the `RecordBatch.to_tensor()` docstrings, see #40064 (comment). ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #40060 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
There is no method currently in Arrow C++ to convert
TableorRecordBatchto aTensor. In #40058 we are proposing to add the conversion and this PR starts with the basic implementation forRecordBatch.What changes are included in this PR?
Basic conversion
RecordBatch→Tensoris added together with Python bindings. The implementation details are:int32data type) support.NaN).Tensor.Are these changes tested?
Yes.
Are there any user-facing changes?
No.