Skip to content

GH-40059: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor#40064

Merged
jorisvandenbossche merged 12 commits intoapache:mainfrom
AlenkaF:record-batch-to-tensor
Mar 5, 2024
Merged

GH-40059: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor#40064
jorisvandenbossche merged 12 commits intoapache:mainfrom
AlenkaF:record-batch-to-tensor

Conversation

@AlenkaF
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF commented Feb 13, 2024

Rationale for this change

There is no method currently in Arrow C++ to convert Table or RecordBatch to a Tensor. In #40058 we are proposing to add the conversion and this PR starts with the basic implementation for RecordBatch.

What changes are included in this PR?

Basic conversion RecordBatchTensor is added together with Python bindings. The implementation details are:

  • One data type (all columns having for example an int32 data type) support.
  • No missing values support (only NaN).
  • Column-major layout of the resulting Tensor.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@AlenkaF AlenkaF marked this pull request as ready for review February 13, 2024 17:24
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 13, 2024
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 14, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 14, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 14, 2024
@AlenkaF AlenkaF requested a review from bkietz February 15, 2024 18:44
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Feb 26, 2024

@pitrou would you have time to look at this PR please? =)

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too familiar with this part of the code base; however, I found this clean, clear, and easy to read!

I might suggest slightly fewer uses of auto (e.g., I don't remember if arr.data() is an ArrayData or a std::shared_ptr<ArrayData>); however, the usage here is definitely consistent with the rest of the Arrow code base.

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Feb 28, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Feb 28, 2024
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Feb 28, 2024

I am not too familiar with this part of the code base; however, I found this clean, clear, and easy to read!

Thank you for looking at the PR!

I might suggest slightly fewer uses of auto (e.g., I don't remember if arr.data() is an ArrayData or a std::shared_ptr<ArrayData>); however, the usage here is definitely consistent with the rest of the Arrow code base.

True, will try to do better in this regard 👍

Copy link
Copy Markdown
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits

auto data = Buffer::Wrap(f_values);

std::shared_ptr<Tensor> tensor_expected;
ASSERT_OK_AND_ASSIGN(tensor_expected, Tensor::Make(float32(), data, shape, f_strides));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind writing or filing an issue for TensorFromJSON()? I think it could make these tests easier to read and write

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, will open an issue and try to work on it! 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Feb 29, 2024
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks and questions, but otherwise looks great to me

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 29, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Mar 1, 2024

def to_tensor(self):
"""
Convert to a :class:`~pyarrow.Tensor`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a next PR, you can document a bit more in detail the constraints of when this will work

@jorisvandenbossche
Copy link
Copy Markdown
Member

Given the approvals, going to merge this to unblock follow-up PRs

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 3ba6d28.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

jorisvandenbossche added a commit that referenced this pull request Mar 26, 2024
…or - add support for different data types (#40359)

### What changes are included in this PR?

- Added support for `RecordBatches` with fields of different type in the conversion `RecordBatch` → `Tensor`.
- Added detail of the constraints to the `RecordBatch.to_tensor()` docstrings, see #40064 (comment).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #40060

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++][Python] Basic conversion of RecordBatch to Arrow Tensor

6 participants