Skip to content

Conversation

@jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Jul 2, 2025

Which issue does this PR close?

Rationale for this change

The arrow-avro crate currently lacks support for the Avro enum type, which is a standard and commonly used type in Avro schemas. This omission prevents users from reading Avro files containing enums, limiting the crate's utility.

This change introduces support for decoding Avro enums by mapping them to the Arrow DictionaryArray type. This is a logical and efficient representation. Implementing this feature brings the arrow-avro crate closer to full Avro specification compliance and makes it more robust for real-world use cases.

What changes are included in this PR?

This PR introduces comprehensive support for Avro enum decoding along with a minor Avro decimal decoding fix. The key changes are:

  1. Schema Parsing (codec.rs):

    • A new Codec::Enum(Arc<[String]>) variant was added to represent a parsed enum and its associated symbols.
    • The make_data_type function now parses ComplexType::Enum schemas. It also stores the original symbols as a JSON string in the Field's metadata under the key "avro.enum.symbols" to ensure schema fidelity and enable lossless round-trip conversions.
    • The Codec::data_type method was updated to map the internal Codec::Enum to the corresponding Arrow DataType::Dictionary(Box<Int32>, Box<Utf8>).
  2. Decoding Logic (reader/record.rs):

    • A new Decoder::Enum(Vec<i32>, Arc<[String]>) variant was added to manage the state of decoding enum values.
    • The Decoder was enhanced to create, decode, and flush Enum types:
      • try_new creates the decoder.
      • decode reads the Avro int index from the byte buffer.
      • flush constructs the final DictionaryArray<Int32Type> using the collected indices as keys and the stored symbols as the dictionary values.
      • append_null was extended to handle nullable enums.
  3. Minor Decimal Type Decoding Fix (codec.rs)

    • A minor decimal decoding fix was implemented in make_data_type due to the (Some("decimal"), c @ Codec::Fixed(sz)) branch of match (t.attributes.logical_type, &mut field.codec) not being reachable. This issue was caught by the new decimal integration tests in arrow-avro/src/reader/mod.rs.

Are these changes tested?

  • Yes, test coverage was provided for the new Enum type:
    • New unit tests were added to record.rs to specifically validate both non-nullable and nullable enum decoding logic.
    • The existing integration test suite in arrow-avro/src/reader/mod.rs was used to validate the end-to-end functionality with a new avro/simple_enum.avro test case, ensuring compatibility with the overall reader infrastructure.
  • New tests were also included for the Decimal and Fixed types:
    • This integration test suite was also extended to include tests for avro/simple_fixed.avro, avro/fixed_length_decimal.avro, avro/fixed_length_decimal_legacy.avro, avro/int32_decimal.avro, avro/int64_decimal.avro

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 2, 2025
@jecsand838 jecsand838 force-pushed the avro-codec-enum branch 2 times, most recently from 3612e07 to 9628f12 Compare July 2, 2025 21:59
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks reasonable to me, but I don't feel qualified to stamp this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838 for this PR and @scovich for the review

I think it looks good to me -- I had a small comment, but I don't think it is necessary

#[test]
fn test_decimal() {
let files = [
("avro/fixed_length_decimal.avro", 25, 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

#[test]
fn test_simple() {
let tests = [
("avro/simple_enum.avro", 4, build_expected_enum(), 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great -- thank you for these tests

("avro/simple_fixed.avro", 2, build_expected_fixed(), 1),
];

fn build_expected_enum() -> RecordBatch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than building up the expected values manually, what do you think about reading the corresponding expected results instead?

https://github.com/apache/arrow-testing/blob/master/data/avro/README.md suggests that the corresponding files are here: https://github.com/apache/parquet-testing/tree/master/data

Doing this would result in adding a parquet (dev) dependency on this crate which might not be ideal 🤔

Copy link
Contributor Author

@jecsand838 jecsand838 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call out.

I was attempting to keep the dependencies (dev ones included) as minimal as possible, however the tests would absolutely be cleaner. I'd also have some hesitation about keeping test files in two repos in sync.

I'll think on it some more, the upcoming PR to add the remaining integration tests maybe a good time to make this change as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was attempting to keep the dependencies (dev ones included) as minimal as possible, however the tests would absolutely be cleaner. I'd also have some hesitation about keeping test files in two repos in sync.

The two repos are already kept in sync (with a git submodule) so you probably don't have to do anything

You could use this function to find the relevant location: https://github.com/apache/arrow-rs/blob/main/arrow/src/util/test_util.rs#L100

@alamb alamb merged commit a3584e5 into apache:main Jul 5, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 5, 2025

Thanks again @scovich and @jecsand838

@jecsand838 jecsand838 deleted the avro-codec-enum branch August 9, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants