Skip to content

fix: compute fixes for extension types#171

Closed
zeroshade wants to merge 3 commits intoapache:mainfrom
zeroshade:extraneous-compute-fixes
Closed

fix: compute fixes for extension types#171
zeroshade wants to merge 3 commits intoapache:mainfrom
zeroshade:extraneous-compute-fixes

Conversation

@zeroshade
Copy link
Copy Markdown
Member

While working on github.com/apache/iceberg-go to enable reading data, I came across a few issues when dealing with UUID / extension types and some other small compute things. This fixes those issues and adds tests to account for them.

Copy link
Copy Markdown
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what are the problems to be fixed...

It may be better that:

  • We use one pull request per problem
  • We add how to reproduce the problem, actual result and expected result in issue/PR description

Comment on lines +143 to +145
uidStorage, _ = scalar.MakeScalarParam(exampleUUID[:],
&arrow.FixedSizeBinaryType{ByteWidth: 16})
uid = scalar.NewExtensionScalar(uidStorage, extensions.NewUUIDType())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uid -> uuid?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid is used as the name of the uuid package, so I'd end up with a conflict / issue if I use the name for the variable here. I can rename it as something else though if we prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! uuidStorage and uuidScalar may be better. (I feel that uid is "user id"...)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

@@ -613,7 +632,7 @@ func executeScalarBatch(ctx context.Context, input compute.ExecBatch, exp expr.E
result.Release()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need result = nil here? (Can we return Release()-ed result?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, probably safer to make it nil you're right. just in case.

@zeroshade
Copy link
Copy Markdown
Member Author

I'm not sure what are the problems to be fixed...

It may be better that:

  • We use one pull request per problem
  • We add how to reproduce the problem, actual result and expected result in issue/PR description

I can separate the change in the compute side from the change on the parquet side. While the changes on the compute side are fixing 2 different issues, they are intertwined and would end up having duplicate code between them if i separate them into different PRs.

I'll split the PR into those two separate ones tomorrow. Essentially it's related to just reading UUID fields from parquet files and utilizing UUID extension types in the compute package to perform comparisons, along with a small error propagation issue.

Copy link
Copy Markdown
Member

@joellubi joellubi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thanks for doing this. We added Arrow Extension types -> Parquet Logical types but I didn't get around to implementing the mapping the other way. Given that, the same may need to be done for the JSON type as well.

case schema.NoLogicalType, schema.IntervalLogicalType:
return &arrow.FixedSizeBinaryType{ByteWidth: int(length)}, nil
case schema.UUIDLogicalType:
return extensions.NewUUIDType(), nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extension type is registered by default, but we currently have some tests failing when we try to explicitly unregister this type. If we want to respect the unregistration and fallback to FixedBinary then perhaps we could use arrow.GetExtensionType() to get the type instead of using the constructor directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Split this out to a separate PR as per @kou's suggestion, and I've made this change in it. It's a good point!

@zeroshade
Copy link
Copy Markdown
Member Author

Closing in favor of #173 and #174

@zeroshade zeroshade closed this Oct 25, 2024
@zeroshade zeroshade deleted the extraneous-compute-fixes branch October 25, 2024 14:15
zeroshade added a commit that referenced this pull request Oct 26, 2024
Split from #171 to be a more focused PR.

Currently we will properly write arrow data with the canonical UUID
extension type as a parquet UUID column via `pqarrow`. This PR enables
us to read back that data using the `extensions.UUID` data type
correctly even when we don't have a stored schema.

Added a test to the `ArrowExtensionTypeRoundTrip` to ensure proper round
trip without a stored schema.

---------

Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
zeroshade added a commit that referenced this pull request Oct 26, 2024
Split from #171 

Enable using the comparison kernels (equal, less, less_equal, greater,
greater_equal) with UUID columns and extension types in general.

Tests are added to check the kernel dispatch and to ensure compute via
substrait works for UUID type scalars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants