-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Enum type support to arrow-avro and Minor Decimal type fix #7852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3612e07 to
9628f12
Compare
…num type support
9628f12 to
8df19e0
Compare
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks reasonable to me, but I don't feel qualified to stamp this PR
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 for this PR and @scovich for the review
I think it looks good to me -- I had a small comment, but I don't think it is necessary
| #[test] | ||
| fn test_decimal() { | ||
| let files = [ | ||
| ("avro/fixed_length_decimal.avro", 25, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| #[test] | ||
| fn test_simple() { | ||
| let tests = [ | ||
| ("avro/simple_enum.avro", 4, build_expected_enum(), 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great -- thank you for these tests
| ("avro/simple_fixed.avro", 2, build_expected_fixed(), 1), | ||
| ]; | ||
|
|
||
| fn build_expected_enum() -> RecordBatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than building up the expected values manually, what do you think about reading the corresponding expected results instead?
https://github.com/apache/arrow-testing/blob/master/data/avro/README.md suggests that the corresponding files are here: https://github.com/apache/parquet-testing/tree/master/data
Doing this would result in adding a parquet (dev) dependency on this crate which might not be ideal 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call out.
I was attempting to keep the dependencies (dev ones included) as minimal as possible, however the tests would absolutely be cleaner. I'd also have some hesitation about keeping test files in two repos in sync.
I'll think on it some more, the upcoming PR to add the remaining integration tests maybe a good time to make this change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was attempting to keep the dependencies (dev ones included) as minimal as possible, however the tests would absolutely be cleaner. I'd also have some hesitation about keeping test files in two repos in sync.
The two repos are already kept in sync (with a git submodule) so you probably don't have to do anything
You could use this function to find the relevant location: https://github.com/apache/arrow-rs/blob/main/arrow/src/util/test_util.rs#L100
|
Thanks again @scovich and @jecsand838 |
Which issue does this PR close?
Part of Add Avro Support #4886
Related to Avro codec enhancements #6965
Rationale for this change
The
arrow-avrocrate currently lacks support for the Avroenumtype, which is a standard and commonly used type in Avro schemas. This omission prevents users from reading Avro files containing enums, limiting the crate's utility.This change introduces support for decoding Avro enums by mapping them to the Arrow
DictionaryArraytype. This is a logical and efficient representation. Implementing this feature brings thearrow-avrocrate closer to full Avro specification compliance and makes it more robust for real-world use cases.What changes are included in this PR?
This PR introduces comprehensive support for Avro enum decoding along with a minor Avro decimal decoding fix. The key changes are:
Schema Parsing (
codec.rs):Codec::Enum(Arc<[String]>)variant was added to represent a parsed enum and its associated symbols.make_data_typefunction now parsesComplexType::Enumschemas. It also stores the original symbols as a JSON string in theField's metadata under the key"avro.enum.symbols"to ensure schema fidelity and enable lossless round-trip conversions.Codec::data_typemethod was updated to map the internalCodec::Enumto the corresponding ArrowDataType::Dictionary(Box<Int32>, Box<Utf8>).Decoding Logic (
reader/record.rs):Decoder::Enum(Vec<i32>, Arc<[String]>)variant was added to manage the state of decoding enum values.Decoderwas enhanced to create, decode, and flushEnumtypes:try_newcreates the decoder.decodereads the Avrointindex from the byte buffer.flushconstructs the finalDictionaryArray<Int32Type>using the collected indices as keys and the stored symbols as the dictionary values.append_nullwas extended to handle nullable enums.Minor Decimal Type Decoding Fix (
codec.rs)make_data_typedue to the(Some("decimal"), c @ Codec::Fixed(sz))branch ofmatch (t.attributes.logical_type, &mut field.codec)not being reachable. This issue was caught by the new decimal integration tests inarrow-avro/src/reader/mod.rs.Are these changes tested?
Enumtype:record.rsto specifically validate both non-nullable and nullable enum decoding logic.arrow-avro/src/reader/mod.rswas used to validate the end-to-end functionality with a newavro/simple_enum.avrotest case, ensuring compatibility with the overall reader infrastructure.DecimalandFixedtypes:avro/simple_fixed.avro,avro/fixed_length_decimal.avro,avro/fixed_length_decimal_legacy.avro,avro/int32_decimal.avro,avro/int64_decimal.avroAre there any user-facing changes?
N/A