fix: Ignore nullability of list elements when consuming Substrait #10874

Blizzara · 2024-06-11T17:34:02Z

DataFusion (or really Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately like we do.

The existing implementation for from_substrait_literal already creates lists that are always nullable
(see ScalarValue::new_list => array_into_list_array). This reverts part of #10640 to align from_substrait_type with that behavior.

This is the error I was hitting:

ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Tested through existing unit tests + manually the failing case I had.

Are there any user-facing changes?

No

DataFusion (= Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability. This is non-trivial to do when converting schema and literals separately. The existing implementation for from_substrait_literal already creates lists that are always nullable (see ScalarValue::new_list => array_into_list_array). This reverts part of apache#10640 to align from_substrait_type with that behavior. This is the error I was hitting: ``` ArrowError(InvalidArgumentError("column types must match schema types, expected List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None) ```

just for consistency, to reduce the places where "item" is written out

Blizzara · 2024-06-11T17:36:28Z

datafusion/common/src/utils/mod.rs

    let offsets = OffsetBuffer::from_lengths([arr.len()]);
    ListArray::new(
-        Arc::new(Field::new("item", arr.data_type().to_owned(), true)),
+        Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)),


this is no-op change, just seems nicer to use the new_list_field given it exists (it sets the name as "item" anyways)

alamb · 2024-06-11T18:41:26Z

The CI failure appears to be unrelated to this PR, so I restarted the tests

alamb

Thanks @Blizzara -- substrait 🚀

alamb · 2024-06-11T18:41:47Z

datafusion/common/src/utils/mod.rs

    let offsets = OffsetBuffer::from_lengths([arr.len()]);
    LargeListArray::new(
-        Arc::new(Field::new("item", arr.data_type().to_owned(), true)),
+        Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)),


alamb · 2024-06-11T18:41:58Z

datafusion/substrait/src/logical_plan/consumer.rs

                let field = Arc::new(Field::new_list_field(
                    from_substrait_type(inner_type, dfs_names, name_idx)?,
-                    is_substrait_type_nullable(inner_type)?,
+                    // We ignore Substrait's nullability here to match to_substrait_literal 


alamb · 2024-06-11T18:43:10Z

datafusion/substrait/src/logical_plan/producer.rs

-                Field::new_list_field(DataType::Int32, nullable).into(),
-            ))?;
-        }
+        round_trip_type(DataType::List(


Should we also add a test here showing that the List becomes nullable after roundtrip? That might additionally document that this is an intended behavior rather than a bug.

I do see you have added a comment which I think is probably enought

I don't think we need to test the null coercion specifically, since that's not necessarily the ultimate desired behavior, just what makes sense at this time. But I added a test case to confirm that we can now read plans with non-nullable lists: a25abd8

alamb · 2024-06-12T16:40:46Z

Thanks again @Blizzara

…ache#10874) * Ignore nullability of list elements when consuming Substrait DataFusion (= Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability. This is non-trivial to do when converting schema and literals separately. The existing implementation for from_substrait_literal already creates lists that are always nullable (see ScalarValue::new_list => array_into_list_array). This reverts part of apache#10640 to align from_substrait_type with that behavior. This is the error I was hitting: ``` ArrowError(InvalidArgumentError("column types must match schema types, expected List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None) ``` * use `Field::new_list_field` in `array_into_(large_)list_array` just for consistency, to reduce the places where "item" is written out * add a test for non-nullable lists

github-actions bot added the substrait Changes to the substrait crate label Jun 11, 2024

use Field::new_list_field in array_into_(large_)list_array

e873b5e

just for consistency, to reduce the places where "item" is written out

Blizzara commented Jun 11, 2024

View reviewed changes

Blizzara marked this pull request as ready for review June 11, 2024 17:36

alamb approved these changes Jun 11, 2024

View reviewed changes

add a test for non-nullable lists

a25abd8

alamb merged commit dfdda7c into apache:main Jun 12, 2024

Blizzara deleted the avo/ignore-list-nullability branch June 19, 2024 10:02

waynexia mentioned this pull request Jul 1, 2024

fix: Ignore nullability in Substrait structs #11130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Ignore nullability of list elements when consuming Substrait #10874

fix: Ignore nullability of list elements when consuming Substrait #10874

Uh oh!

Blizzara commented Jun 11, 2024

Uh oh!

Blizzara Jun 11, 2024

Uh oh!

alamb commented Jun 11, 2024

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 11, 2024

Uh oh!

alamb Jun 11, 2024

Uh oh!

alamb Jun 11, 2024

Uh oh!

Blizzara Jun 11, 2024 •

edited

Loading

Uh oh!

alamb commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Ignore nullability of list elements when consuming Substrait #10874

fix: Ignore nullability of list elements when consuming Substrait #10874

Uh oh!

Conversation

Blizzara commented Jun 11, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Blizzara Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 11, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

alamb Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

Blizzara Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Blizzara Jun 11, 2024 •

edited

Loading