Skip to content

Conversation

@Blizzara
Copy link
Contributor

DataFusion (or really Arrow) is quite strict about nullability, specifically, when using e.g. LogicalPlan::Values, the given schema must match the given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately like we do.

The existing implementation for from_substrait_literal already creates lists that are always nullable
(see ScalarValue::new_list => array_into_list_array). This reverts part of #10640 to align from_substrait_type with that behavior.

This is the error I was hitting:

ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Tested through existing unit tests + manually the failing case I had.

Are there any user-facing changes?

No

DataFusion (= Arrow) is quite strict about nullability, specifically,
when using e.g. LogicalPlan::Values, the given schema must match the
given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately.

The existing implementation for from_substrait_literal already creates
lists that are always nullable
(see ScalarValue::new_list => array_into_list_array).
This reverts part of apache#10640 to
align from_substrait_type with that behavior.

This is the error I was hitting:
```
ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)
```
@github-actions github-actions bot added the substrait Changes to the substrait crate label Jun 11, 2024
just for consistency, to reduce the places where "item" is written out
let offsets = OffsetBuffer::from_lengths([arr.len()]);
ListArray::new(
Arc::new(Field::new("item", arr.data_type().to_owned(), true)),
Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no-op change, just seems nicer to use the new_list_field given it exists (it sets the name as "item" anyways)

@Blizzara Blizzara marked this pull request as ready for review June 11, 2024 17:36
@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

The CI failure appears to be unrelated to this PR, so I restarted the tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blizzara -- substrait 🚀

let offsets = OffsetBuffer::from_lengths([arr.len()]);
LargeListArray::new(
Arc::new(Field::new("item", arr.data_type().to_owned(), true)),
Arc::new(Field::new_list_field(arr.data_type().to_owned(), true)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let field = Arc::new(Field::new_list_field(
from_substrait_type(inner_type, dfs_names, name_idx)?,
is_substrait_type_nullable(inner_type)?,
// We ignore Substrait's nullability here to match to_substrait_literal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Field::new_list_field(DataType::Int32, nullable).into(),
))?;
}
round_trip_type(DataType::List(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a test here showing that the List becomes nullable after roundtrip? That might additionally document that this is an intended behavior rather than a bug.

I do see you have added a comment which I think is probably enought

Copy link
Contributor Author

@Blizzara Blizzara Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to test the null coercion specifically, since that's not necessarily the ultimate desired behavior, just what makes sense at this time. But I added a test case to confirm that we can now read plans with non-nullable lists: a25abd8

@alamb alamb merged commit dfdda7c into apache:main Jun 12, 2024
@alamb
Copy link
Contributor

alamb commented Jun 12, 2024

Thanks again @Blizzara

@Blizzara Blizzara deleted the avo/ignore-list-nullability branch June 19, 2024 10:02
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…ache#10874)

* Ignore nullability of list elements when consuming Substrait

DataFusion (= Arrow) is quite strict about nullability, specifically,
when using e.g. LogicalPlan::Values, the given schema must match the
given literals exactly - including nullability.
This is non-trivial to do when converting schema and literals separately.

The existing implementation for from_substrait_literal already creates
lists that are always nullable
(see ScalarValue::new_list => array_into_list_array).
This reverts part of apache#10640 to
align from_substrait_type with that behavior.

This is the error I was hitting:
```
ArrowError(InvalidArgumentError("column types must match schema types, expected
List(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found
List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0"), None)
```

* use `Field::new_list_field` in `array_into_(large_)list_array`

just for consistency, to reduce the places where "item" is written out

* add a test for non-nullable lists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants