Skip to content

Conversation

@aocsa
Copy link
Contributor

@aocsa aocsa commented Jul 26, 2021

Implement "drop null" kernels that return array without nulls

Note: This can be implemented as a arrow::compute::VectorFunction because the size of the array is changed, so this function is not valid in a SQL-like context

@aocsa aocsa changed the title [C++] Implement "drop null" kernels that return array without nulls ARROW-1568: [ComputeKernels] Add Drop Null for Arrays Kernel Jul 26, 2021
@github-actions
Copy link

@aocsa aocsa changed the title ARROW-1568: [ComputeKernels] Add Drop Null for Arrays Kernel ARROW-1568: [C++] Add Drop Null for Arrays Kernel Jul 26, 2021
@aocsa aocsa marked this pull request as draft July 26, 2021 15:34
@aocsa aocsa changed the title ARROW-1568: [C++] Add Drop Null for Arrays Kernel ARROW-1568: [C++] Implement Drop Null Kernel for Arrays Jul 26, 2021
@apache apache deleted a comment from github-actions bot Jul 27, 2021
@aocsa aocsa marked this pull request as ready for review July 30, 2021 03:33
@aocsa
Copy link
Contributor Author

aocsa commented Jul 30, 2021

This PR is ready to review :)

Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments on the general approach. Great work @aocsa 👏

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Niranda that the record batch implementation probably won't work but I think this is a great start and looks like a very good addition. Thanks for making this!

@aocsa
Copy link
Contributor Author

aocsa commented Aug 2, 2021

I updated this PR base on last feedback comments. Main change, the way algorithm drop nulls now are based on filter operation and boolean filter array is obtained directly using the bitmap from input array.

@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch from bc5fd70 to f0e3318 Compare August 2, 2021 18:13
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, added a few comments!

How does "drop_null" work for nested types? I suppose it only takes into account the top-level validity? (if so, might be worth to test (if not yet done) and document explicitily)

@aocsa
Copy link
Contributor Author

aocsa commented Aug 5, 2021

I updated this PR addressing latest feedback comments.
I think this PR is ready for merging. Let me know if anything needs to be taken care of from my end. cc @nirandaperera, @jorisvandenbossche, @westonpace

Copy link
Contributor

@nirandaperera nirandaperera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @aocsa 👍 I left some minor comments :-)

@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch 2 times, most recently from 5b1eedb to eb8f878 Compare August 6, 2021 03:18
@aocsa
Copy link
Contributor Author

aocsa commented Aug 6, 2021

Thanks @nirandaperera. I updated the PR based on latest feedback. Nice to know more about the Arrow BitUtils :)
IMO this PR is ready to be merged :) +1 cc @westonpace

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few very minor things. Most are fine for a follow up, remove the prints and I'll be good.

@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch from eb8f878 to fff75c5 Compare August 9, 2021 16:33
@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch from 1313d61 to 96b3742 Compare August 10, 2021 06:34
@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch from b4d9285 to c98ac6d Compare August 16, 2021 17:37
@aocsa aocsa force-pushed the aocsa/ARROW-1568 branch from c98ac6d to 27c9cb4 Compare August 17, 2021 04:22
@aocsa
Copy link
Contributor Author

aocsa commented Aug 18, 2021

Thanks again @pitrou, just a reminder this PR was updated. Main change related to DropNullTable

... is to get TableBatchReader to get a vector of record batches spanning the table, then call drop_null on each record batch and then Table::FromRecordBatches to get the resulting table.

#10802 (comment)

return call_function('take', [data, indices], options, memory_pool)


def drop_null(data, memory_pool=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this as the proper definition is auto-generated from the C++ metadata.

}
auto drop_null_filter =
std::make_shared<BooleanArray>(batch->num_rows(), dst, nullptr, 0, 0);
if (drop_null_filter->null_count() == batch->num_rows()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is never satisfied, will fix.

auto schm = schema(fields);
std::vector<std::shared_ptr<ChunkedArray>> empty_table(fields.size());
auto col_a = std::make_shared<NullArray>(size);
auto col_b = std::make_shared<NullArray>(size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the wrong column types. I will fix and add a validation.

@pitrou
Copy link
Member

pitrou commented Aug 18, 2021

Note the auto-generated Python docstring isn't entirely accurate (see the input argument description):

Signature: pc.drop_null(input, *, memory_pool=None)
Docstring:
Drop nulls from the input.

The output is populated with values from the input (Array, ChunkedArray,
RecordBatch, or Table) without the null values.
For the RecordBatch and Table cases, `drop_null` drops the full row if
there is any null.

Parameters
----------
input : Array-like or scalar-like
    Argument to compute function
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.
File:      ~/arrow/dev/python/pyarrow/compute.py
Type:      function

I'll open a separate JIRA.

@aocsa
Copy link
Contributor Author

aocsa commented Aug 18, 2021

Note the auto-generated Python docstring isn't entirely accurate (see the input argument description):

Signature: pc.drop_null(input, *, memory_pool=None)
Docstring:
Drop nulls from the input.

The output is populated with values from the input (Array, ChunkedArray,
RecordBatch, or Table) without the null values.
For the RecordBatch and Table cases, `drop_null` drops the full row if
there is any null.

Parameters
----------
input : Array-like or scalar-like
    Argument to compute function
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.
File:      ~/arrow/dev/python/pyarrow/compute.py
Type:      function

I'll open a separate JIRA.

Ok, thanks @pitrou let me know in which jira ticket will be this to follow up the updates.

@pitrou
Copy link
Member

pitrou commented Aug 18, 2021

It's in https://issues.apache.org/jira/browse/ARROW-13665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants