-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add array/map/fixed schema resolution and default value support to arrow-avro codec #8292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add array/map/fixed schema resolution and default value support to arrow-avro codec #8292
Conversation
…rrow-avro codec.rs Implements handling of default values, including validation during schema resolution. Adds support for resolving differences in `array`, `map`, and `fixed` schemas between writer and reader. Updates codec logic to handle nested default values and enhances resolution for record fields.
57db02d to
1f6cf02
Compare
2b58c87 to
2b4cd3d
Compare
2b4cd3d to
5e4744c
Compare
|
@alamb Would you be able to review this PR whenever you get a chance? About 60% of the new code are tests. |
| } | ||
| } | ||
|
|
||
| // Handle JSON nulls per-spec: allowed only for `null` type or unions with null FIRST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall we're supporting first and second positions elsewhere in this crate re: impala - can you clarify why it's not an issue that we're diverging from that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good catch. I was planning to bring in NullSecond support here as a part of the full Dense Union type decoding support I'm finishing up right now. Which is why I left the comment calling it out.
arrow-avro/src/codec.rs
Outdated
| })?; | ||
| let lit = f.data_type().parse_default_literal(&v)?; | ||
| out.insert(name, lit); | ||
| } else if f.data_type().nullability() == Some(Nullability::default()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a cheap check - worth it to move it up in the chain before the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-d-ef Good callout. I see what you're getting at. I went ahead and cleaned this code up a bit more.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 and @nathaniel-d-ef
As always, I don't really have a huge amount of insight about the avro spec, so I wouldn't be able to catch any avro corner cases here, but the code seems clear to me and well tested. Thank you
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
692167d to
ea56641
Compare
|
Thank you @jecsand838 and @nathaniel-d-ef |
# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution and default value support to arrow-avro codec), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s specification requires readers to materialize default values when a field exists in the **reader** schema but not in the **writer** schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes `arrow-avro` more standards‑compliant and improves interoperability with evolving schemas. # What changes are included in this PR? **High‑level summary** * **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.) **Touched files (2):** * `arrow-avro/src/reader/record.rs` - refactor decoder to use precomputed mappings and defaults. * `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults and error cases (see below). # Are these changes tested? Yes, new integration tests cover both the **happy path** and **validation errors**: * `test_schema_resolution_defaults_all_supported_types`: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows. * `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid enum default symbol is rejected. * `test_schema_resolution_default_fixed_size_mismatch_errors`: mismatched fixed/bytes default lengths are rejected. These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings). # Are there any user-facing changes? N/A
Which issue does this PR close?
This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec.
Rationale for this change
Avro’s schema resolution requires readers to reconcile differences between the writer and reader schemas, including:
Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for default values and for resolving array/map/fixed shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios.
What changes are included in this PR?
This PR modifies
arrow-avro/src/codec.rsto extend the schema-resolution pathDefault value handling for record fields
Array / Map / Fixed schema resolution
Codec updates
Are these changes tested?
Yes. This PR includes new unit tests in
arrow-avro/src/codec.rscovering:Null/union‑nullability rules; metadata persistence of defaults (AVRO_FIELD_DEFAULT_METADATA_KEY).AvroLiteralParsingi32/f32; correct literals fori64/f64;Utf8/Utf8View;uuidstrings (RFC‑4122).bytes/fixeddefaults;Fixed(n)length enforcement;decimalonfixedvsbytes;duration/interval fixed 12‑byte enforcement.inttolong) for arrays; reader metadata precedence for colliding attributes;fixedname/size match including alias.Are there any user-facing changes?
N/A