Skip to content

Ensure columns have valid null counts in CUDF JNI.#13355

Merged
rapids-bot[bot] merged 13 commits intorapidsai:branch-23.06from
mythrocks:jni-nullcount
May 18, 2023
Merged

Ensure columns have valid null counts in CUDF JNI.#13355
rapids-bot[bot] merged 13 commits intorapidsai:branch-23.06from
mythrocks:jni-nullcount

Conversation

@mythrocks
Copy link
Copy Markdown
Contributor

@mythrocks mythrocks commented May 16, 2023

Description

Fixes #13353.
Depends on #13345.

In preparation for #11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to UNKNOWN_NULL_COUNT (i.e. -1). In cases where the caller invokes JNI functions with UNKNOWN_NULL_COUNT, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column.

The current Java API remains unchanged; there should be no impact to user code.

The option to specify an optional null count through the Java API will likely be removed at a later date.

Signed-off-by: MithunR mythrocks@gmail.com

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

vyasr and others added 7 commits May 11, 2023 17:45
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
Fixes rapidsai#13353.

In preparation for rapidsai#11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to `UNKNOWN_NULL_COUNT` (i.e. `-1`). In cases where the caller invokes JNI functions with `UNKNOWN_NULL_COUNT`, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column.

The option to specify an optional null count through the Java API will likely be removed at a later date.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks mythrocks added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 16, 2023
@mythrocks mythrocks requested review from a team as code owners May 16, 2023 04:48
@mythrocks mythrocks self-assigned this May 16, 2023
@mythrocks mythrocks requested review from bdice, charlesbluca and vuule May 16, 2023 04:48
@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 16, 2023
@mythrocks
Copy link
Copy Markdown
Contributor Author

This diff includes the changes in #13345, but it applies only to the Java API.

@mythrocks mythrocks added Spark Functionality that helps Spark RAPIDS and removed libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels May 16, 2023
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 16, 2023
@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented May 16, 2023

#13345 is merged, so @mythrocks you should be able to rebase on the latest 23.06 now.

@jlowe Mithun mentioned that for some partitioning applications Spark really doesn't need to know the null count of partitioned columns because the columns are ultimately reassembled and therefore only the total count/mask matters. As a future optimization, I noted that in those cases it may make sense to eventually switch to passing around the column's device buffers instead of the columns themselves if avoiding the extra intermediate null counts has a significant performance benefit.

@vyasr vyasr mentioned this pull request May 17, 2023
3 tasks
mythrocks added 2 commits May 17, 2023 13:50
1. Use column::null_count() instead of computing explicitly.
2. Remove reference to UNKNOWN_NULL_COUNT.
@mythrocks mythrocks requested a review from ttnghia May 17, 2023 21:40
@vyasr vyasr removed request for a team, bdice, charlesbluca and vuule May 17, 2023 22:03
mythrocks added a commit to mythrocks/spark-rapids-jni that referenced this pull request May 17, 2023
This is in prep for rapidsai/cudf#11968 and
rapidsai/cudf#13372.

`libcudf` will soon require that all CUDF columns are created with a
known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported,
or even available as a code constant.

This change replicates part of rapidsai/cudf#13355,
as it applies to `row_conversion.cu`. The (single) reference to
the unknown-null-count is replaced with a pre-calculated value.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 12060af into rapidsai:branch-23.06 May 18, 2023
@mythrocks
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews, all. I've merged this change.

mythrocks added a commit to NVIDIA/spark-rapids-jni that referenced this pull request May 18, 2023
This is in prep for rapidsai/cudf#11968 and
rapidsai/cudf#13372.

`libcudf` will soon require that all CUDF columns are created with a
known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported,
or even available as a code constant.

This change replicates part of rapidsai/cudf#13355,
as it applies to `row_conversion.cu`. The (single) reference to
the unknown-null-count is replaced with a pre-calculated value.

Signed-off-by: MithunR <mythrocks@gmail.com>
rapids-bot bot pushed a commit that referenced this pull request May 24, 2023
This is the final PR for removing `UNKNOWN_NULL_COUNT` and the implicit kernel launch in the `null_count` methods of `column` and `column_view`. 

Depends on #13355 and #13341.

Closes #11968

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] JNI should calculate null count when it is unknown

5 participants