Skip to content

Conversation

@bu2
Copy link
Contributor

@bu2 bu2 commented Dec 28, 2020

Purpose a "replace" compute kernel which could fulfil ARROW-10641 - [C++] A "replace" or "map" kernel to replace values in array based on mapping (@jorisvandenbossche) ARROW-9430 - [C++/Python] Kernel for SetItem(BooleanArray, values) (@xhochy). The implementation started on the basis of "fill_null" kernel. But it takes an additional BooleanArray parameter which is used as a mask to trigger value replacement.

With following "null handling" logic:

Values		Mask		Replacement	Output
1 v		1 false		1 r		1 v
1 v		1 true		1 r		1 r
1 v		0 null		1 r		1 v
1 v		1 false		0 null		1 v
1 v		1 true		0 null		1 v
1 v		0 null		0 null		1 v
0 null		1 false		1 r		0 null
0 null		1 true		1 r		1 r
0 null		0 null		1 r		0 null
0 null		1 false		0 null		0 null
0 null		1 true		0 null		0 null
0 null		0 null		0 null		0 null

(each column indicates the validity bit and the corresponding value. "v" for current value, "r" for replacement value).

@github-actions
Copy link

@bu2 bu2 marked this pull request as draft December 29, 2020 00:07
@bu2
Copy link
Contributor Author

bu2 commented Dec 29, 2020

@jorisvandenbossche: Would you have you look at this in-progress PR and share your thought ? Replace kernel implemented for most (if not all) types but missing null handling.

@bu2
Copy link
Contributor Author

bu2 commented Dec 31, 2020

Note on ArrayData::GetValues(i) in the context of commit ab2d3d2:

  • ArrayData::GetValues(i) use "byte offset" even for buffers[2] which stores variable width values...
    => Should it rather use buffers[1] value offset ?
    => Or force absolute_offset to 0 for buffers[2] ?

On my case, I fixed the bug in my code by forcing absolute_offset to 0.

@bu2 bu2 marked this pull request as ready for review January 4, 2021 17:30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full imports and put them inside the anonymous namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @emkornfield, not sure to get your point... Do you mean:

namespace arrow {
namespace compute {
namespace internal {

namespace {

using ::arrow::internal::BitBlockCount;
using ::arrow::internal::BitBlockCounter;

...
...
}

}
}
}

@pitrou
Copy link
Member

pitrou commented Jan 5, 2021

I wouldn't expect a "replace" operation to do this. Instead, this looks more like a "select" operation. @wesm What do you think?

@jorisvandenbossche
Copy link
Member

This might actually rather be the "setitem" kernel as proposed in https://issues.apache.org/jira/browse/ARROW-9430 ? (which is something we want as well)

@bu2
Copy link
Contributor Author

bu2 commented Jan 5, 2021

@jorisvandenbossche: Thank you for bringing up ARROW-9430.

Then @ALL please tell me what would be a good name.

I may have to rework the code a bit. As mentionned in the top description, I started based on fill_null kernel code and the "null handling" logic came as an afterthought, so the code might look a bit convoluted right now...

@pitrou
Copy link
Member

pitrou commented Jan 5, 2021

"setitem" is confusing to me (I would expect something where you pass indices and values to set at those indices). "select" is used in Numpy. Another possible is "cond".

@jorisvandenbossche
Copy link
Member

"setitem" is confusing to me (I would expect something where you pass indices and values to set at those indices).

That's https://issues.apache.org/jira/browse/ARROW-9431.
There could be a variant of "setitem" that takes integer indices, and one that takes a boolean mask?

But it's true that for a "setitem" kernel, you would expect the "values to set" (Replacement in the example in the top post) to be of the length of the number of values that get set, not of the same length as the original array.

So what this PR proposes, is actually more a "where" kernel (using numpy's terminology), for which we have https://issues.apache.org/jira/browse/ARROW-10640.
There is some discussion on that PR that this can be generalized to something like numpy's np.choose. And the np.select mentioned by @pitrou is quite related to that as well (the exact semantics are a bit different between both: select uses boolean conditions to choose between the arrays, choose a single array of indices into the choice arrays)

@pitrou
Copy link
Member

pitrou commented Jan 5, 2021

I'd say that personally, I find "select" more intuitively guessable than "where" (which is a weird name for a function). But I could also live with it if it's Numpy's choice. @michalursa

@jorisvandenbossche
Copy link
Member

Agreed that "where" is an unintuitive name. For me it's fine to directly use the more general name from numpy, which would actually be both "choose" or "select" (I think the single boolean mask can be seen as a special case for both generalized functions: assuming that the single boolean mask from "where" is interpreted as 0/1 integer indices) for "choose", or is interpreted as a boolean mask + its inverse mask for "select").

Another option mentioned in the JIRA is "if_else" (term used in the R world) or "case_when" (inspired on SQL).

@pitrou
Copy link
Member

pitrou commented Jan 5, 2021

"if_else", "choose", "select" are all fine with me.

@bu2
Copy link
Contributor Author

bu2 commented Jan 5, 2021

Wow, I was not expecting such brainstorming on the name of this kernel. My initial intent was to mimic pandas.Series.replace(to_replace, value), where in C++ to_replace could be an BooleanArray mask (same length as input array) which trigger value replacement => So this kernel could be used after any combination of Compare and Boolean operations to implement specific replacement logic.

This could be extended to another implementation where to_replace could be an IntegerArray (length <= as input array) of indexes to replace.

Here is how I use "replace" in combination with "is_nan" (ARROW-11043) to implement fillna():

template <typename value_type>
std::shared_ptr<DataFrame> DataFrame::fillna(value_type value) {
    auto outdf = std::make_shared<DataFrame>();
    ...
    ...
    for (int i = 0 ; i < this->table_->num_columns() ; ++i) {
        auto field = this->table_->schema()->field(i);
        auto chunked_array = this->table_->column(i);
        auto value_datum = arrow::compute::Cast(arrow::Datum(value), chunked_array->type()).ValueOrDie();
        auto nulls = arrow::compute::IsNull(chunked_array).ValueOrDie();
        auto nans = arrow::compute::IsNan(chunked_array).ValueOrDie().chunked_array();
        auto to_replace = arrow::compute::Or(nulls, nans).ValueOrDie();
        auto clean_chunked_array = arrow::compute::Replace(chunked_array, to_replace, value_datum).ValueOrDie().chunked_array();
        outdf->table_ = outdf->table_->AddColumn(i, field, clean_chunked_array).ValueOrDie();
    }

    return outdf;
}

bu2 added 8 commits January 28, 2021 00:23
Run clang-format to fix indentation and pass CI lint check.
Fix newline characters at end of file for CI lint check.
Add null handling to "replace" kernel:

* Add some more tests with nulls (start with the boolean kernel)
Add null handling to "replace" kernel:

* Add null handling to the boolean kernel implementation.
* Test pass but still need more tests.
Generalize null handling to "replace" kernel:

* Add null handling to the Number, Timestamps, and String kernel implementations.
* Add more tests regarding null handling logic (with nulls either in input values and mask).
* Still need to fix a tricky bug in test of the String kernel implementation.
Fix String kernel implementation:

* offsets were used inconsistently hence a bug on sliced arrays.
* ArrayData::GetValues<T>(i) use "byte offset" even for buffer[2] which store variable width values...
=> Should it rather use buffer[1] value offset ?
All type kernels implemented and tested, polish PR:

* apply clang-format to pass CI lint check.
@pitrou
Copy link
Member

pitrou commented Feb 8, 2021

Wow, I was not expecting such brainstorming on the name of this kernel.

Sorry for the distraction :-) That said, for clarity it seems that "if_else" would be both explicit and non-misleading.

The signature could be if_else(cond, when_true, when_false) with the semantics cond ? when_true : when_false. Nulls would probably just pass through.

Do you want to update this PR?

@jorisvandenbossche
Copy link
Member

@bu2 would you have time to update this based on the above comments? (#9024 (comment))

out_values += block.length;
in_values += block.length;
}
output->buffers[1] = out_buf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, you could use std::move here. (applies to several places, where buffer shared_ptr's are assigned)

@pitrou
Copy link
Member

pitrou commented Jun 10, 2021

Closing as superceded by #10410.

@pitrou pitrou closed this Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants