Skip to content

Conversation

@joosthooz
Copy link
Contributor

@joosthooz joosthooz commented Jan 6, 2022

Targets https://issues.apache.org/jira/browse/ARROW-9285
If a kernel uses preallocated memory allocation (MemAllocation::PREALLOCATE), it should not create new buffers. This code aims to check for this.
Questions/TODO;

  • If the check triggers an abort by using DCHECK_*, it is not possible to test whether a misbehaving kernel is detected. I added an example test with a misbehaving kernel, but for that to work the check should throw an Error that the test can then check for using ASSERT_RAISES_WITH_MESSAGE
  • Should the check only happen for output buffers? Now it also checks the input batches
  • Scalar kernels have a field preallocate_contiguous_, but there might be additional cases where this check could be useful? For example if validity_preallocated_ is true, we could check if there aren't any newly created validity buffers. But the current AddBuffersToSet() code does not support this.
  • Vector kernels do not have such a field, but may still pre-allocate their memory. Now I just use the condition (kernel_->mem_allocation == MemAllocation::PREALLOCATE) but I don't know if that is the right way to go.

@joosthooz joosthooz marked this pull request as draft January 6, 2022 10:43
@github-actions
Copy link

github-actions bot commented Jan 6, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@joosthooz joosthooz changed the title WIP: Arrow-9285: Detect unauthorized memory allocations in function kernels Arrow-9285: Detect unauthorized memory allocations in function kernels Jan 6, 2022
@edponce
Copy link
Contributor

edponce commented Jan 6, 2022

@joosthooz I have following initial comments:

  1. ScalarExecutor has two possible preallocation policies (contiguous and per-batch) but they both are assigned to out->value which is an ArrayData (out is a Datum wrapping either an Array or a Scalar. I think we need to validate out only for the Array case. Use out.array() to access ArrayData and its buffers.
    • Note: buffer[0] is the null bitmap, which does not need this validation bc it has its own nullity policies (see NullPropagatorandNullHandling`).
  2. The VectorExecutor can also preallocate so it also requires this validation. I think here we would also validate out.
  3. Kernels establish preallocation policy via kernel.mem_allocation = MemAllocation::PREALLOCATE during registration. For example, here.

cc @bkietz @pitrou

@edponce
Copy link
Contributor

edponce commented Jan 6, 2022

@joosthooz Need to change this PR title to: ARROW-9285: [C++] Detect unauthorized memory allocations in function kernels

@joosthooz joosthooz changed the title Arrow-9285: Detect unauthorized memory allocations in function kernels ARROW-9285: [C++] Detect unauthorized memory allocations in function kernels Jan 6, 2022
@github-actions
Copy link

github-actions bot commented Jan 6, 2022

Copy link
Contributor

@edponce edponce Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if hashing all the values-buffers pointers would suffice instead of storing them in a set. We would still need to traverse nested data structures to capture all the child buffers. This could be solved using a Visitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashing would be nicer, but it would also trigger the error if a kernel deletes a buffer it doesn't need anymore. And I don't think we care about that, just newly allocated ones. What do you think?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not posting a more detailed review (for now!), these are only two high-level comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we want to handle this feature at all, but we definitely don't want to run those checks unconditionally, unless they're extremely trivial. Perhaps this can be enabled only on debug builds? (#ifndef _NDEBUG, for example)

cc @lidavidm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug-only is fine as this will be most useful for developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks are now inside #ifndef NDEBUG.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be added in the public APIs. Instead, these functions should only exist in the corresponding test file, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed them again, also from the function registry. I'm now using a different way of calling the function through an executor, instead of going through CallFunction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: to reduce the performance overhead, std::unordered_set<Buffer*> would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionError a Gandiva-specific error. Perhaps use Status::Invalid?

@pitrou
Copy link
Member

pitrou commented Jan 17, 2022

I think there's a problem with this approach: MemAllocation::PREALLOCATE preallocates only fixed-width buffers, not variable-width buffers. So for example, for a string array, indices will be preallocated but not data.

Besides, child buffers are never preallocated. So what happens is that you must check that preallocated buffers are not changed, but this doesn't prevent other buffers from being allocated as well.

@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants