fix: compute fixes for extension types by zeroshade · Pull Request #171 · apache/arrow-go

zeroshade · 2024-10-25T00:03:14Z

While working on github.com/apache/iceberg-go to enable reading data, I came across a few issues when dealing with UUID / extension types and some other small compute things. This fixes those issues and adds tests to account for them.

kou

I'm not sure what are the problems to be fixed...

It may be better that:

We use one pull request per problem
We add how to reproduce the problem, actual result and expected result in issue/PR description

kou · 2024-10-25T00:25:31Z

arrow/compute/exprs/exec_test.go

+		uidStorage, _ = scalar.MakeScalarParam(exampleUUID[:],
+			&arrow.FixedSizeBinaryType{ByteWidth: 16})
+		uid = scalar.NewExtensionScalar(uidStorage, extensions.NewUUIDType())


uid -> uuid?

uuid is used as the name of the uuid package, so I'd end up with a conflict / issue if I use the name for the variable here. I can rename it as something else though if we prefer.

Ah! uuidStorage and uuidScalar may be better. (I feel that uid is "user id"...)

kou · 2024-10-25T00:30:21Z

arrow/compute/exprs/exec.go

@@ -613,7 +632,7 @@ func executeScalarBatch(ctx context.Context, input compute.ExecBatch, exp expr.E
 			result.Release()


Do we need result = nil here? (Can we return Release()-ed result?)

hmm, probably safer to make it nil you're right. just in case.

zeroshade · 2024-10-25T01:06:12Z

I'm not sure what are the problems to be fixed...

It may be better that:

We use one pull request per problem

We add how to reproduce the problem, actual result and expected result in issue/PR description

I can separate the change in the compute side from the change on the parquet side. While the changes on the compute side are fixing 2 different issues, they are intertwined and would end up having duplicate code between them if i separate them into different PRs.

I'll split the PR into those two separate ones tomorrow. Essentially it's related to just reading UUID fields from parquet files and utilizing UUID extension types in the compute package to perform comparisons, along with a small error propagation issue.

joellubi

Looks good overall, thanks for doing this. We added Arrow Extension types -> Parquet Logical types but I didn't get around to implementing the mapping the other way. Given that, the same may need to be done for the JSON type as well.

joellubi · 2024-10-25T13:19:24Z

parquet/pqarrow/schema.go

+	case schema.NoLogicalType, schema.IntervalLogicalType:
 		return &arrow.FixedSizeBinaryType{ByteWidth: int(length)}, nil
+	case schema.UUIDLogicalType:
+		return extensions.NewUUIDType(), nil


This extension type is registered by default, but we currently have some tests failing when we try to explicitly unregister this type. If we want to respect the unregistration and fallback to FixedBinary then perhaps we could use arrow.GetExtensionType() to get the type instead of using the constructor directly.

I Split this out to a separate PR as per @kou's suggestion, and I've made this change in it. It's a good point!

zeroshade · 2024-10-25T14:15:02Z

Closing in favor of #173 and #174

Split from #171 to be a more focused PR. Currently we will properly write arrow data with the canonical UUID extension type as a parquet UUID column via `pqarrow`. This PR enables us to read back that data using the `extensions.UUID` data type correctly even when we don't have a stored schema. Added a test to the `ArrowExtensionTypeRoundTrip` to ensure proper round trip without a stored schema. --------- Co-authored-by: Sutou Kouhei <kou@cozmixng.org>

Split from #171 Enable using the comparison kernels (equal, less, less_equal, greater, greater_equal) with UUID columns and extension types in general. Tests are added to check the kernel dispatch and to ensure compute via substrait works for UUID type scalars.

zeroshade added 2 commits October 24, 2024 19:34

fix: compute fixes for extension types

c779c38

update tests

2189067

zeroshade requested review from joellubi, kou and lidavidm October 25, 2024 00:03

kou reviewed Oct 25, 2024

View reviewed changes

lidavidm approved these changes Oct 25, 2024

View reviewed changes

joellubi reviewed Oct 25, 2024

View reviewed changes

updates from feedback

30ebdfe

This was referenced Oct 25, 2024

fix(parquet): Reading UUID columns #173

Merged

fix(arrow/compute): compare kernels with UUID #174

Merged

zeroshade closed this Oct 25, 2024

zeroshade deleted the extraneous-compute-fixes branch October 25, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compute fixes for extension types#171

fix: compute fixes for extension types#171
zeroshade wants to merge 3 commits intoapache:mainfrom
zeroshade:extraneous-compute-fixes

zeroshade commented Oct 25, 2024

Uh oh!

kou left a comment

Uh oh!

kou Oct 25, 2024

Uh oh!

zeroshade Oct 25, 2024

Uh oh!

kou Oct 25, 2024

Uh oh!

zeroshade Oct 25, 2024

Uh oh!

kou Oct 25, 2024

Uh oh!

zeroshade Oct 25, 2024

Uh oh!

zeroshade commented Oct 25, 2024

Uh oh!

joellubi left a comment

Uh oh!

joellubi Oct 25, 2024

Uh oh!

zeroshade Oct 25, 2024

Uh oh!

zeroshade commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -613,7 +632,7 @@ func executeScalarBatch(ctx context.Context, input compute.ExecBatch, exp expr.E
		result.Release()

Conversation

zeroshade commented Oct 25, 2024

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade commented Oct 25, 2024

Uh oh!

joellubi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants