GH-34453: [Go] Support Builders for user defined extensions#34454
GH-34453: [Go] Support Builders for user defined extensions#34454zeroshade merged 8 commits intoapache:mainfrom
Conversation
|
|
zeroshade
left a comment
There was a problem hiding this comment.
Took a look through, thoughts?
go/arrow/datatype_extension.go
Outdated
There was a problem hiding this comment.
should we follow the same pattern as ArrayType and have a BuilderType method which returns a reflect.Type that we use to wrap the ExtensionBuilder with? This also avoids the import cycle.
Another thing to consider is that this is going to break any and all existing Extension types in other consumers' codebases. We should probably make a second interface type which contains the BuilderType method so that we can just use a type assertion test in NewBuilder rather than break existing consumers?
There was a problem hiding this comment.
On the first part, we can keep the same pattern, sure. But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).
Re 2nd thing, not sure I followed, can you share an example of what this going to break? The only place that we use this function now is here and I fallback to the previous builder.
There was a problem hiding this comment.
For the second point: Note that you had to add the NewBuilder method to the existing extension types. By adding a new method to the interface, anyone who has their own extension types defined that doesn't already have this method defined will have a compile error when they upgrade to this version of Arrow which adds the method because their structs will no longer meet the interface for ExtensionType.
But actually I think this should 1) provide better performance, no reflection. 2) much better type safety, and better developer experience as it is clear what should be the type of this function (I also wanted to return Builder but had to fallback to interface due to some cyclic import that didn't want to fix part of this PR).
In general, I wouldn't expect creating a builder to be a performance bottleneck as consumers shouldn't create builders repeatedly. That said, a couple ideas I've had so far:
- Like was previously done for Arrays, we could shift the definition of the
Builderinterface to thearrowpackage directly and then add an alias in thearraypackage to ensure we don't break any consumers (with a deprecated message telling people to point atarrow.Builderinstead. - If we're going in this direction, rather than passing the allocator and the extension type, we should pass an
ExtensionBuilderto the method and have this just wrap it and return the wrapped builder. The consumer can also retrieve the extension type and the allocator from the builder directly if they need to. So perhaps something likeWrapBuilder(ExtensionBuilder) Builder.
There was a problem hiding this comment.
Re breaking, I don't think this will breaking anything as there is a default in the ExtensionType that just returns nil to keep this backward compatible so if this is not implement I call the old NewExtensionBuilder in the NewBuilder function
There was a problem hiding this comment.
@zeroshade can you please take a look at this comment ^ ? (This is why I didn't re-requested a re-review yet as need your guidance here)
There was a problem hiding this comment.
Ah I missed that! Sorry! and you do check for the nil in NewBuilder, okay that's fair. So that avoids the breaking, you're completely correct.
So just need to address the other point in possibly changing the signature / solving the import cycle.
There was a problem hiding this comment.
re using Builder in the signature I think it might be more complex than it seems (I tried it initially). For example newData *Data function which means we need also to take Data to arrow pacakge. In that case it might be even easier to move datatype_extension.go to array because we can import from array the arrow package but not the otherway around (Also, there are even more complications because the interface has private functions which would cause compilation error if we move it to a different package). Can't think of a much easier way without a big refactor which I think better to avoid atm and just do the runtime check. WDYT ?
There was a problem hiding this comment.
Darn, that's really annoying..... I agree it's better to avoid a big refactor atm.
I still want to change the signature though instead of having consumers have to create both the underlying storage builder & wrap it.
Instead of adding this method to the ExtensionType interface, we could just go with an interface defined in the array package which would let us use Builder in the method definition. something like:
type ExtensionTypeCustomBuilder interface {
NewBuilder(ExtensionBuilder) Builder
}Then in builder.go you can do:
case arrow.EXTENSION:
typ := dtype.(arrow.ExtensionType)
bldr := NewExtensionBuilder(mem, typ)
if custom, ok := typ.(ExtensionTypeCustomBuilder); ok {
return custom.NewBuilder(bldr)
}
return bldrWhat do you think?
There was a problem hiding this comment.
I like that! Updated.
Blocked by cloudquery/plugin-sdk#724 but ready for initial review and discussion. I can do a short walkthrough for anyone up for review. Apache arrow fork is here (We use it until [this](apache/arrow#34454) is merged): https://github.com/cloudquery/arrow/tree/feat_extension_builder. Some more notes: - Right now we are just migrating the json writer/reader to use apache arrow so we can roll format by format and see if there are any real world issues before we roll this to everywhere instead of our own type system.
@zeroshade Thanks for the initial review! Commented and also linked to a few example on how we use it. |
b6560ef to
2e0ae3d
Compare
zeroshade
left a comment
There was a problem hiding this comment.
Just request a re-review once this is ready for a new review, thanks.
4a8a202 to
1d38d96
Compare
1d38d96 to
8f8429a
Compare
|
Ready now for another review, though looks like some workflows are failing. I think this is not connected though to this PR |
|
|
||
| type UUIDBuilder struct { | ||
| *array.ExtensionBuilder | ||
| dtype *UUIDType |
There was a problem hiding this comment.
the failures are because this field is unused here
There was a problem hiding this comment.
Oh ok. Fixed. I wasn't able to find it from the action. Where should I look for it next time? (Can I run those linters locally as well?)
There was a problem hiding this comment.
So that was actually found by the Go build failing, and I saw it by looking at the log file for the failing Go workflows.
You can also run the linters (or most of the workflows) by following the instructions here: https://arrow.apache.org/docs/developers/continuous_integration/archery.html to install the archery utility in the repo and then running archery docker run <job> as described here.
There was a problem hiding this comment.
Sounds good. Thanks, I'll try those for next PR! Tests passing now.
zeroshade
left a comment
There was a problem hiding this comment.
A few more nitpicks. We're almost there! Thanks again for all this!
| data := make([][]byte, len(v)) | ||
| for i, v := range v { | ||
| data[i] = v[:] | ||
| } | ||
| b.ExtensionBuilder.Builder.(*array.FixedSizeBinaryBuilder).AppendValues(data, valid) |
There was a problem hiding this comment.
this could be done more efficiently if desired via resize + UnsafeAppend or other methods. Not necessary for this PR but just something to think about.
zeroshade
left a comment
There was a problem hiding this comment.
just a single whitespace nitpick haha. Fix that and i'll merge this 😄
|
Benchmark runs are scheduled for baseline = 71f3c56 and contender = 5219de3. 5219de3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This should serve as discussion as it's a medium change but this should Close #34453 and give users the ability to define custom Builder for their extensions just like they define ExtensionTypes and ExtensionArrays