Skip to content

Recursive topk #3395

@crusaderky

Description

@crusaderky

I have an array with shape (500000, ) and chunks=2500.
I need to pick the top 5000 elements out of it.

The current topk implementation is very bad for it, as:

  1. it applies one kernel per chunk of the original input
  2. concatenates all the outputs of the first pass together into a single chunk
  3. performs a single, final sort

As my k is greater than the chunk size, effectively step 2 would create a single-chunk array or 500000 points.
The workaround would be to rechunk the array ahead of topk to larger chunks - which is often not the easiest thing to do when you're implementing a general-purpose library call.

As a solution I suggest to add a recursive=False option to both the method and the function; see prototype below:

https://gist.github.com/crusaderky/35d355d8124141aeea82213e0c7bccde

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions