-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: Resolve Avro RecordEncoder bugs related to nullable Struct fields and Union type ids #8935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… `arrow-avro` writer.
81cdda2 to
6bcba74
Compare
|
@nathaniel-d-ef @alamb @mbrobbel Let me know if any of you have time to review this bug fix. I was hoping to get it out for the next release. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 -- this looks good to me
arrow-avro/src/writer/encoder.rs
Outdated
| let null_state = match nullability { | ||
| None => NullState::NonNullable, | ||
| Some(order) => { | ||
| if let Some(nulls) = array.nulls().cloned() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the prior code actually checked if there were zero nulls via array.null_count() there can be a null buffer that has no nulls in it.
I think it is worth checking here still. Something like
if array.null_count() > 0 && let Some(nulls) = array.nulls().cloned() {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb That's a great catch! Thank you for calling that out.
I just pushed up the fixes for this which also include a new unit test covering this behavior. Additionally, there's an extremely small and internal-facing micro-optimization which removes the .cloned.
…at locks in the desired zero-null check behavior. Also implemented a small and non-breaking/non-public-facing micro-optimization that removed `cloned` from the updated zero nulls checking logic by adding the `'a` lifetime to `NullState`.
5972885 to
d96fc62
Compare
26e06e1 to
c199811
Compare
|
Thanks @jecsand838 |
Which issue does this PR close?
Rationale for this change
The
arrow-avrowriter currently fails on two classes of valid Arrow inputs:Nullable
Structwith non‑nullable children + row‑wise sliced encodingWhen encoding a
RecordBatchrow‑by‑row, a nullableStructfield whose child is non‑nullable can cause the writer to error withInvalid argument error: Avro site '{field}' is non-nullable, but array contains nulls, even when the parentStructis null at that row and the child value should be ignored.Dense
UnionArraywith non‑zero, non‑consecutive type idsA dense
UnionArraywhoseUnionFieldsuse type ids such as2and5will currently fail with aSchemaError("Binding and field mismatch"), even though this layout is valid per Arrow’s union semantics.This PR updates the
RecordEncoderto resolve both of these issues and better respect Arrow’s struct/union semantics.What changes are included in this PR?
This PR touches only the
arrow-avrowriter implementation, specificallyarrow-avro/src/writer/encoder.rsandarrow-avro/src/writer/mod.rs.1. Fix nullable struct + non‑nullable child handling
RecordEncoder/StructEncoderpath so that child field null validation is masked by the parentStruct’s null bitmap.Structvalue is null, the encoder now skips encoding the non‑nullable children for that row, instead of treating any child‑side nulls as a violation of the Avro site’s nullability.RecordBatch, like the one in the issue’s reproducing test, now succeeds without triggeringInvalid argument error: Avro site '{field}' is non-nullable, but array contains nulls.2. Support dense unions with non‑zero, non‑consecutive type ids
UnionEncoder) so that it no longer assumes Arrow dense union type IDs are0..N-1.type_ids(as declared inUnionFields) to Avro union branch indices, and uses this mapping when:2and5now encode successfully, matching Arrow’s semantics that only require type ids to be consistent withUnionFields, not only contiguous and/or zero‑based.3. Regression tests for both bugs
Adds targeted regression tests under
arrow-avro/src/writer/mod.rs’s test module to validate the fixes:test_nullable_struct_with_nonnullable_field_sliced_encodingStruct+ non‑nullable child scenario from the issue.RecordBatchone row at a time viaWriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_, AvroSoeFormat>(...)and asserts all rows encode successfully.test_nullable_struct_with_decimal_and_timestamp_slicedRecordBatchcontaining nullableStructfields populated withDecimal128andTimestampMicrosecondtypes to verify encoding of complex nested data.RecordBatchone row at a time usingAvroSoeFormatandFingerprintStrategy::Id(1), asserting that each sliced row encodes successfully.non_nullable_child_in_nullable_struct_should_encode_per_rowStructcolumn containing a non-nullable child field, alongside a timestamp column.AvroSoeFormat, asserting thatwriter.writereturnsOkto confirm the fix for sliced encoding constraints.test_union_nonzero_type_idsUnionArraywhoseUnionFieldsuse type ids[2, 5]and a mix of string/int values.AvroWriterand asserts that writing and finishing the writer both succeed without error.Together these tests reproduce the failures described in #8934 and confirm that the new encoder behavior handles them correctly.
Are these changes tested?
Yes.
Are there any user-facing changes?
The change is strictly a non-breaking backwards compatible bug fix that makes the
arrow-avrowriter function as expected.