Skip to content

Conversation

@dhruv9vats
Copy link
Contributor

@dhruv9vats dhruv9vats commented Jan 16, 2022

Implement a Scalar Kernel to lookup a value for a given key in a MapArray, whose type is an alias for List(Struct(<key>, <item>))

Todo:

  • Add more tests and combinations (Do suggest more)

    • null maps
    • empty maps
    • null items
    • null non-empty items
      Types tested:
      • PhysicalIntegralArrowTypes
      • DecimalArrowTypes
      • BaseBinaryArrowTypes
      • Boolean
      • FixedSizeString
      • MonthDayNanoInterval
  • Template Kernel

Things to consider:

  • Looking up null keys. (Null keys are not allowed)

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@dhruv9vats dhruv9vats requested a review from lidavidm January 18, 2022 12:49
Comment on lines 582 to 584
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per your comment below, map_scalar.is_valid needs to be checked before casting, but in that case, how could a NullScalar of an appropriate type be made (which requires item_type)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, the cast from Scalar to MapScalar is fine - it's the cast from MapScalar.value to StructArray that is failing, because value is nullptr when is_valid == false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest looking at this in a debugger if you haven't already, just to see what this case looks like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing something like:

auto descrs = batch.GetDescriptors();
std::shared_ptr<DataType> type = descrs.front().type;
std::shared_ptr<DataType> item_type = checked_cast<const MapType&>(*type).item_type();

Like in ResolveMapArrayLookupType okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked_cast<const MapType&>(*batch[0].type()) is enough, a Datum has its own type.

std::shared_ptr<DataType> item_type = checked_cast<const MapType&>(*type).item_type();
std::shared_ptr<DataType> key_type = checked_cast<const MapType&>(*type).key_type();

if (!options.query_key || !options.query_key->type ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does the query_key null checking part, right? @lidavidm
The Errors test also tests for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this tests if the shared_ptr is null, but a valid Scalar may contain a null value (Scalar.is_valid) - that needs to be checked as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, we should break out the error separately for !query_key || !query_key->is_valid so that we give a relevant error message. This will crash if we give a nullptr query_key since the error message tries to read query_key->type.)

IndexOptions
JoinOptions
MakeStructOptions
MapArrayLookupOptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to revisit and implement this.

To do so, you'll need to:

  1. Add the options class definition to the Cython pxd:
    cdef cppclass CIndexOptions \
    "arrow::compute::IndexOptions"(CFunctionOptions):
    CIndexOptions(shared_ptr[CScalar] value)
    shared_ptr[CScalar] value
  2. Add a Python options class wrapper to the Cython code:
    cdef class _IndexOptions(FunctionOptions):
    def _set_options(self, scalar):
    self.wrapped.reset(new CIndexOptions(pyarrow_unwrap_scalar(scalar)))
    class IndexOptions(_IndexOptions):
    """
    Options for the `index` function.
    Parameters
    ----------
    value : Scalar
    The value to search for.
    """
    def __init__(self, value):
    self._set_options(value)
  3. Add an import so it's in the public API:
    from pyarrow._compute import ( # noqa
  4. Add the options class to the serialization test:
    pc.IndexOptions(pa.scalar(1)),
    (just add it to the list there, no need to add additional assertions to the test)
  5. Add a brief test to make sure everything works (no need to add detailed tests, we just want to make sure the bindings were set up):
    def test_index():
    arr = pa.array([0, 1, None, 3, 4], type=pa.int64())
    assert pc.index(arr, pa.scalar(0)).as_py() == 0
    assert pc.index(arr, pa.scalar(2, type=pa.int8())).as_py() == -1
    assert pc.index(arr, 4).as_py() == 4
    assert arr.index(3, start=2).as_py() == 3
    assert arr.index(None).as_py() == -1
    arr = pa.chunked_array([[1, 2], [1, 3]], type=pa.int64())
    assert arr.index(1).as_py() == 0
    assert arr.index(1, start=2).as_py() == 2
    assert arr.index(1, start=1, end=2).as_py() == -1

@dhruv9vats dhruv9vats requested a review from lidavidm January 31, 2022 18:19
+---------------------+------------+-------------------------------------+------------------+---------------------------------+--------+
| list_parent_indices | Unary | List-like | Int64 | | \(3) |
+---------------------+------------+-------------------------------------+------------------+---------------------------------+--------+
| map_array_lookup | Unary | Map | Computed | :struct:`MapArrayLookupOptions` | \(4) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry to bikeshed here, but now that I look at it…would "map_lookup" be a better kernel name? I think we only use "array" in a kernel name when the kernel only works on arrays, but this works on arrays and scalars.

Copy link
Contributor Author

@dhruv9vats dhruv9vats Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, better to be rigorous here than have an inconsistent codebase. We just will have to be on the lookout for anything that I've missed renaming.

@lidavidm
Copy link
Member

lidavidm commented Feb 1, 2022

Also just note a few lints from CI:

pyarrow.compute.map_array_lookup
-> pyarrow.compute.Find the items corresponding to a given key in a MapArray.
PR01: Parameters {'scalar'} not documented

pyarrow._compute.MapArrayLookupOptions
PR01: Parameters {'scalar'} not documented

@dhruv9vats dhruv9vats marked this pull request as ready for review February 2, 2022 09:56
@dhruv9vats dhruv9vats requested a review from lidavidm February 2, 2022 14:17
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my last two comments, promise :)

@lidavidm lidavidm closed this in 76decf6 Feb 3, 2022
@lidavidm
Copy link
Member

lidavidm commented Feb 3, 2022

Thanks @dhruv9vats!

@ursabot
Copy link

ursabot commented Feb 3, 2022

Benchmark runs are scheduled for baseline = 5ab4112 and contender = 76decf6. 76decf6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.36% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.48% ⬆️0.78%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants