Skip to content

[FEA] Require null counts on column construction #11968

@vyasr

Description

@vyasr

Is your feature request related to a problem? Please describe.
Currently the null count may be provided when constructing a column, or it may be left unknown to be computed lazily when it is needed. This approach has performance benefits in some cases, but it is problematic for exposing a stream-ordered API because null counts may be used in contexts where a stream is not available and requesting the null count instead triggers a kernel launch on cudf's default stream. Additionally, the current approach sometimes leads to calling code not providing a null count even though it is known, resulting in an unnecessary extra computation.

Describe the solution you'd like
We should change the column constructor to require the null count on construction. This change will guarantee that requesting the null count does not result in a kernel launch, which removes the most problematic known blocker for implementing a stream-ordered API in libcudf. We will also need to make the same change to column_view. When constructed from a column, the column_view's null count will now be trivially available provided, but when constructed from an external data pointer it will again be the caller's responsibility to provide the null count.

To support these new requirements, we should expose public utility functions for computing the null count given a null mask.

Describe alternatives you've considered
We could instead have the column/column_view constructors accept the null count optionally and then compute the null count if one is not provided. However, that would also necessitate accepting a stream in the constructor that may be used. The resulting API is more confusing and requires more logic, whereas the calling code already has all the information available to make the best decision here.

Additional context
There are certain cases that will be made slower by this requirement. Two cases in particular come to mind that we will need to be cognizant of and attempt to mitigate the costs:

  1. Places where null counts are not needed. In such cases, we will now be requiring a computation that could previously be omitted entirely.
  2. The creation of slices. Slices will also now need to always know their null counts, which could add significant costs to a number of APIs relying on slices. These functions may need to be rewritten to more intelligently precompute the null counts, or be otherwise modified to reduce and/or hide the costs of this extra work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    1 - On DeckTo be worked on nextfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions