-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: Ignore nullability of list elements when consuming Substrait #10874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DataFusion (= Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability. This is non-trivial to do when converting schema and literals separately. The existing implementation for from_substrait_literal already creates lists that are always nullable (see ScalarValue::new_list => array_into_list_array). This reverts part of apache#10640 to align from_substrait_type with that behavior. This is the error I was hitting: ``` ArrowError(InvalidArgumentError("column types must match schema types, expected List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None) ```
just for consistency, to reduce the places where "item" is written out
| let offsets = OffsetBuffer::from_lengths([arr.len()]); | ||
| ListArray::new( | ||
| Arc::new(Field::new("item", arr.data_type().to_owned(), true)), | ||
| Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no-op change, just seems nicer to use the new_list_field given it exists (it sets the name as "item" anyways)
|
The CI failure appears to be unrelated to this PR, so I restarted the tests |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Blizzara -- substrait 🚀
| let offsets = OffsetBuffer::from_lengths([arr.len()]); | ||
| LargeListArray::new( | ||
| Arc::new(Field::new("item", arr.data_type().to_owned(), true)), | ||
| Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| let field = Arc::new(Field::new_list_field( | ||
| from_substrait_type(inner_type, dfs_names, name_idx)?, | ||
| is_substrait_type_nullable(inner_type)?, | ||
| // We ignore Substrait's nullability here to match to_substrait_literal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| Field::new_list_field(DataType::Int32, nullable).into(), | ||
| ))?; | ||
| } | ||
| round_trip_type(DataType::List( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a test here showing that the List becomes nullable after roundtrip? That might additionally document that this is an intended behavior rather than a bug.
I do see you have added a comment which I think is probably enought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to test the null coercion specifically, since that's not necessarily the ultimate desired behavior, just what makes sense at this time. But I added a test case to confirm that we can now read plans with non-nullable lists: a25abd8
|
Thanks again @Blizzara |
…ache#10874) * Ignore nullability of list elements when consuming Substrait DataFusion (= Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability. This is non-trivial to do when converting schema and literals separately. The existing implementation for from_substrait_literal already creates lists that are always nullable (see ScalarValue::new_list => array_into_list_array). This reverts part of apache#10640 to align from_substrait_type with that behavior. This is the error I was hitting: ``` ArrowError(InvalidArgumentError("column types must match schema types, expected List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None) ``` * use `Field::new_list_field` in `array_into_(large_)list_array` just for consistency, to reduce the places where "item" is written out * add a test for non-nullable lists
DataFusion (or really Arrow) is quite strict about nullability, specifically, when using e.g.
LogicalPlan::Values, the given schema must match the given literals exactly - including nullability.This is non-trivial to do when converting schema and literals separately like we do.
The existing implementation for from_substrait_literal already creates lists that are always nullable
(see ScalarValue::new_list => array_into_list_array). This reverts part of #10640 to align from_substrait_type with that behavior.
This is the error I was hitting:
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Tested through existing unit tests + manually the failing case I had.
Are there any user-facing changes?
No