Skip to content

Conversation

@tobixdev
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This causes the values buffer to have the expected length after creating the null array.

What changes are included in this PR?

Use MutableBuffer::new_null instead of MutableBuffer::new

Are these changes tested?

Yes, additional constructor test

Are there any user-facing changes?

Yes, the buffer will now be correctly initialized when calling FixedSizeBinaryArray::new_null

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 21, 2025

let null_array = FixedSizeBinaryArray::new_null(4, 3);
assert_eq!(null_array.len(), 3);
assert_eq!(null_array.values().len(), 12);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reported 0 before this change.

Self {
data_type: DataType::FixedSizeBinary(size),
value_data: MutableBuffer::new(capacity).into(),
value_data: MutableBuffer::new_null(capacity_in_bytes * 8).into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the 8 come from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MutableBuffer::new_null is in bits.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tobixdev and @etseidl

Self {
data_type: DataType::FixedSizeBinary(size),
value_data: MutableBuffer::new(capacity).into(),
value_data: MutableBuffer::new_null(capacity_in_bytes * 8).into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value_data: MutableBuffer::new_null(capacity_in_bytes * 8).into(),
// MutableBuffer::new_null is in bits.
value_data: MutableBuffer::new_null(capacity_in_bytes * 8).into(),

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with the arrow side of the house, but this seems like correct behavior. It matches what a primitive array would do, and the value_data bytes are already allocated. The only downside I can think of is the extra memset, but that would be done for primitive arrays as well IIUC.

@alamb
Copy link
Contributor

alamb commented Nov 21, 2025

I also think we need to prioritize correctness over performance (obviously 😆 ) -- there isn't any obvious way to make this faster that I could see 🤔

@alamb alamb merged commit 73dbd55 into apache:main Nov 22, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 22, 2025

🚀

@tobixdev
Copy link
Contributor Author

The only downside I can think of is the extra memset, but that would be done for primitive arrays as well IIUC.

I think zeroing is not required for null values based on the arrow specification. But I guess not doing it has some foot guns and UB issues that we want to avoid. Curious if there is a take on that from arrow-rs. 🤔

Thanks for merging that! I'll make a follow-up PR that adds the small comment.

@alamb
Copy link
Contributor

alamb commented Nov 24, 2025

I think zeroing is not required for null values based on the arrow specification. But I guess not doing it has some foot guns and UB issues that we want to avoid. Curious if there is a take on that from arrow-rs. 🤔

Allocating anything in rust is going to zero out the bytes anyways. Unless there is a compelling performance benchmark that shows avoiding the (potentially) extra zeroing makes a non trivial difference I think we should opt for safe

alamb pushed a commit that referenced this pull request Nov 25, 2025
)

# Which issue does this PR close?

- Follow-up on #8901 

# Rationale for this change

It's non-obvious why the number "8" appears here.

# What changes are included in this PR?

Name the number such that it's more obvious that this is a conversion
from bytes to bits.

@alamb I can also include the [suggested
comment](#8901 (comment))
if you prefer it. I thought the constant may have a lesser risk of
becoming outdated without being noticed when changes to
`MutableBuffer::new_null` happen.

# Are these changes tested?

- No behavior changes

# Are there any user-facing changes?

- No
github-merge-queue bot pushed a commit to apache/datafusion that referenced this pull request Nov 30, 2025
…ary in ScalarValue::to_array_of_size (#18903)

## Which issue does this PR close?

- Closes #18870.

We should open a follow-up issue that tracks re-use
`FixedSizeBinaryArray::new_null` after upgrading arrow-rs.

With this PR, we could backport it to #18843 if we wish.

Alternatively, we could also wait for a fix by arrow-rs.

## Rationale for this change

As we will have to wait for another arrow-rs update to get the fix for
the issue apache/arrow-rs#8901 , we have a
temporaray fix that basically calls the same operations from
DataFusion's code. Once the fix is in the arrow-rs codebase, we can
re-use `FixedSizeBinaryArray::new_null`.

## What changes are included in this PR?

- A test. (`assert_eq!(result.as_fixed_size_binary().values().len(),
10);` returned 0 before)
- Directly allocate the values buffer in DataFusion.

## Are these changes tested?

Yes

## Are there any user-facing changes?

Yes, the buffer will now be allocated, zeroed, and set to the expected
length. Same as in the arrow-rs fix.
shashidhar-bm pushed a commit to shashidhar-bm/datafusion that referenced this pull request Dec 1, 2025
…ary in ScalarValue::to_array_of_size (apache#18903)

## Which issue does this PR close?

- Closes apache#18870.

We should open a follow-up issue that tracks re-use
`FixedSizeBinaryArray::new_null` after upgrading arrow-rs.

With this PR, we could backport it to apache#18843 if we wish.

Alternatively, we could also wait for a fix by arrow-rs.

## Rationale for this change

As we will have to wait for another arrow-rs update to get the fix for
the issue apache/arrow-rs#8901 , we have a
temporaray fix that basically calls the same operations from
DataFusion's code. Once the fix is in the arrow-rs codebase, we can
re-use `FixedSizeBinaryArray::new_null`.

## What changes are included in this PR?

- A test. (`assert_eq!(result.as_fixed_size_binary().values().len(),
10);` returned 0 before)
- Directly allocate the values buffer in DataFusion.

## Are these changes tested?

Yes

## Are there any user-facing changes?

Yes, the buffer will now be allocated, zeroed, and set to the expected
length. Same as in the arrow-rs fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FixedSizeBinaryArray::new_null Does Not Properly Set the Length of the Values Buffer

4 participants