Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec.

Rationale for this change

Avro’s schema resolution requires readers to reconcile differences between the writer and reader schemas, including:

  • Using record-field default values when the writer lacks a field present in the reader; defaults must be type-correct (i.e., union defaults match the first union member; bytes/fixed defaults are JSON strings).
  • Recursively resolving arrays (by item schema) and maps (by value schema).
  • Resolving fixed types (size and unqualified name must match) and erroring when they do not.

Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for default values and for resolving array/map/fixed shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios.

What changes are included in this PR?

This PR modifies arrow-avro/src/codec.rs to extend the schema-resolution path

  • Default value handling for record fields

    • Reads and applies default values when the reader expects a field absent from the writer, including nested defaults.
    • Validates defaults per the Avro spec (e.g., union defaults match the first schema; bytes/fixed defaults are JSON strings).
  • Array / Map / Fixed schema resolution

    • Array: recursively resolves item schemas (writer↔reader).
    • Map: recursively resolves value schemas.
    • Fixed: enforces matching size and (unqualified) name; otherwise signals an error, consistent with the spec.
  • Codec updates

    • Refactors internal codec logic to support the above during decoding, including resolution for record fields and nested defaults. (See commit message for the high-level summary.)

Are these changes tested?

Yes. This PR includes new unit tests in arrow-avro/src/codec.rs covering:

  1. Default validation & persistence
    • Null/union‑nullability rules; metadata persistence of defaults (AVRO_FIELD_DEFAULT_METADATA_KEY).
  2. AvroLiteral Parsing
    • Range checks for i32/f32; correct literals for i64/f64; Utf8/Utf8View; uuid strings (RFC‑4122).
    • Byte‑range mapping for bytes/fixed defaults; Fixed(n) length enforcement; decimal on fixed vs bytes; duration/interval fixed 12‑byte enforcement.
  3. Collections & records
    • Array/map defaults shape; enum symbol validity; record defaults for missing fields, required‑field errors, and honoring field‑level defaults; skip‑fields retained for writer‑only fields.
  4. Resolution mechanics
    • Element promotion (int to long) for arrays; reader metadata precedence for colliding attributes; fixed name/size match including alias.

Are there any user-facing changes?

N/A

…rrow-avro codec.rs

Implements handling of default values, including validation during schema resolution. Adds support for resolving differences in `array`, `map`, and `fixed` schemas between writer and reader. Updates codec logic to handle nested default values and enhances resolution for record fields.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 8, 2025
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-resolvable-types branch from 57db02d to 1f6cf02 Compare September 8, 2025 17:00
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-resolvable-types branch 2 times, most recently from 2b58c87 to 2b4cd3d Compare September 8, 2025 22:46
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-resolvable-types branch from 2b4cd3d to 5e4744c Compare September 9, 2025 02:39
@jecsand838
Copy link
Contributor Author

@alamb Would you be able to review this PR whenever you get a chance? About 60% of the new code are tests.

}
}

// Handle JSON nulls per-spec: allowed only for `null` type or unions with null FIRST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall we're supporting first and second positions elsewhere in this crate re: impala - can you clarify why it's not an issue that we're diverging from that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good catch. I was planning to bring in NullSecond support here as a part of the full Dense Union type decoding support I'm finishing up right now. Which is why I left the comment calling it out.

})?;
let lit = f.data_type().parse_default_literal(&v)?;
out.insert(name, lit);
} else if f.data_type().nullability() == Some(Nullability::default()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a cheap check - worth it to move it up in the chain before the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathaniel-d-ef Good callout. I see what you're getting at. I went ahead and cleaned this code up a bit more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838 and @nathaniel-d-ef

As always, I don't really have a huge amount of insight about the avro spec, so I wouldn't be able to catch any avro corner cases here, but the code seems clear to me and well tested. Thank you

@jecsand838 jecsand838 force-pushed the avro-schema-resolution-resolvable-types branch from 692167d to ea56641 Compare September 10, 2025 22:00
@alamb alamb merged commit 567f441 into apache:main Sep 11, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

Thank you @jecsand838 and @nathaniel-d-ef

mbrobbel pushed a commit that referenced this pull request Sep 16, 2025
# Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.

- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution
and default value support to arrow-avro codec), #8124 (schema resolution
& type promotion for the decoder), #8223 (enum mapping for schema
resolution). These previous efforts established the foundations that
this PR extends to default values and additional resolvable types.

# Rationale for this change

Avro’s specification requires readers to materialize default values when
a field exists in the **reader** schema but not in the **writer**
schema, and to validate defaults (i.e., union defaults must match the
first branch; bytes/fixed defaults must be JSON strings; enums may
specify a default symbol for unknown writer symbols). Implementing this
behavior makes `arrow-avro` more standards‑compliant and improves
interoperability with evolving schemas.

# What changes are included in this PR?

**High‑level summary**

* **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style
abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only
fields, and (b) materialize reader‑only defaulted fields, reducing
branching in the hot path. (See commit subject and record decoder
changes.)
**Touched files (2):**

* `arrow-avro/src/reader/record.rs` - refactor decoder to use
precomputed mappings and defaults.
* `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults
and error cases (see below).

# Are these changes tested?

Yes, new integration tests cover both the **happy path** and
**validation errors**:
* `test_schema_resolution_defaults_all_supported_types`: verifies that
defaults for
boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested
record and unions are materialized correctly for all rows.
* `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid
enum default symbol is rejected.
* `test_schema_resolution_default_fixed_size_mismatch_errors`:
mismatched fixed/bytes default lengths are rejected.

These tests assert the Avro‑spec behavior (i.e., union defaults must
match the first branch; bytes/fixed defaults use JSON strings).

# Are there any user-facing changes?

N/A
@jecsand838 jecsand838 deleted the avro-schema-resolution-resolvable-types branch September 22, 2025 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants