Important
This repository has been archived and is no longer maintained by NVIDIA. It
was the result of R&D for ray tracing optimized clusters. Some core algorithm
concepts have been adopted and further optimized in meshoptimizer's
meshopt_buildMeshletsSpatial(), to
which we refer interested developers.
nv_cluster_builder is a small generic spatial clustering C++ library, created to cluster triangle meshes for ray tracing. It is very similar to a recursive node splitting algorithm to create a bounding volume hierarchy (BVH). It is limited to axis aligned splits but also produces clusters with desirable attributes for raytracing.
Input
- Spatial locality
${\color{red}\text{Bounding\ boxes}}$ ${\color{blue}\text{Centroids}}$
-
${\color{green}\text{Connectivity}}$ (Optional)- Adjacency lists
- Weights
Output
Cluster items (membership)
- Ranges: { {
${\color{blue}0,4}$ } , {${\color{red}4,4}$ } } - Items: {
${\color{blue}3}$ ,${\color{blue}4}$ ,${\color{blue}6}$ ,${\color{blue}1}$ ,${\color{red}2}$ ,${\color{red}7}$ ,${\color{red}0}$ ,${\color{red}1}$ }
Notable features:
- Primarily spatial, making clusters from bounding boxes
- Optional user-defined weighted adjacency
- Generic, not just triangles
- Customizable [minβmax] cluster sizes
- Parallel, using std::execution
- Segmented API for clustering multiple subsets at once
- Knobs to balance optimizations
For a complete usage example, see https://github.com/nvpro-samples/vk_animated_clusters.
For more details, refer to nvcluster.h (and
optionally nvcluster_storage.hpp).
The tests may also be useful to look through.
#include <nvcluster/nvcluster.h>
#include <nvcluster/nvcluster_storage.hpp>
...
// Create bounding boxes for each item to be clustered
std::vector<nvcluster_AABB> boundingBoxes{
nvcluster_AABB{{0, 0, 0}, {1, 1, 1}}, // for example
...
};
// Generate centroids
std::vector<glm::vec3> centroids(boundingBoxes.size());
for(size_t i = 0; i < boundingBoxes.size(); i++)
{
centroids[i] = 0.5f * (glm::vec3(boundingBoxes[i].bboxMin) + glm::vec3(boundingBoxes[i].bboxMax));
}
// Input
nvcluster_Input input{.itemBoundingBoxes = reinterpret_cast<nvcluster_AABB*>(boundingBoxes.data()),
.itemCentroids = reinterpret_cast<nvcluster_Vec3f*>(centroids.data()),
.itemCount = static_cast<uint32_t>(boundingBoxes.size())};
nvcluster_Config config{
.minClusterSize = 128,
.maxClusterSize = 128,
.costUnderfill = 0.0f, // zero to one (exclusive)
.costOverlap = 0.0f, // zero to one (exclusive)
.preSplitThreshold = 0, // median-split bigger nodes (0=disable)
};
// Create context (there's also ScopedContext in test_util.hpp)
nvcluster_ContextCreateInfo info{};
nvcluster_Context context;
nvclusterCreateContext(&info, &context); // Add error checking
// Create clusters
// This is a thin wrapper with std::vector storage for nvclusterBuild(...)
nvcluster::ClusterStorage clustering;
nvcluster::generateClusters(context, config, input, clustering); // Add error checking, don't leak context etc.
// Do something with the result
for(size_t clusterIndex = 0; clusterIndex < clustering.clusterItemRanges.size(); ++clusterIndex)
{
const nvcluster_Range& range = clustering.clusterItemRanges[clusterIndex];
for(uint32_t clusterItemIndex = 0; clusterItemIndex < range.count; ++clusterItemIndex)
{
uint32_t clusterItem = clustering.items[range.offset + clusterItemIndex];
...
}
}
// If not wrapping the C API,
nvclusterDestroyContext(context);
This library uses CMake and requires C++20. It compiles as a static library by
default. Use -DNVCLUSTER_BUILDER_SHARED=ON to compile a shared library. Data
is passed as structures of arrays and the output must be allocated by the user.
Integration has been verified by directly including it with add_subdirectory:
add_subdirectory(nv_cluster_builder)
...
target_link_libraries(my_target PUBLIC nv_cluster_builder)
If there is interest, please reach out for CMake config files (for
find_package()) or any other features. GitHub issues are welcome.
Just a C++20 compiler.
Parallel execution on linux uses tbb if available. For ubuntu, sudo apt install libtbb-dev.
If tests are enabled (set the CMake BUILD_TESTING variable to ON),
nv_cluster_builder will use FetchContent
to download GoogleTest.
Authors and contact:
- Pyarelal Knowles (pknowles 'at' nvidia.com), NVIDIA
- Karthik Vaidyanathan, NVIDIA
Cluster goals:
- Consistent size for batch processing
- Spatially adjacent
- Well connected
- Small bounding box (low SAH cost)
- Low overlap
- Useful for ray tracing
The algorithm is basic recursive bisection:
- Sorts inputs by centroids on each axis
- Initialize with a root node containing everything
- Recursively split until the desired leaf size is reached
- Compute candidate split costs for all positions in all axes
- Split at the lowest cost, maintaining sorted centroids by partitioning
- Leaves become clusters
Novel additions:
- Limit split candidates to guarantee fixed cluster sizes
- Optimize for full clusters
- Optimize for less bounding box overlap
- Optimize for minimum ratio cut cost if adjacency exists
The optimizations are implemented by converting and summing additional costs with the surface area heuristic (SAH) cost and choosing a split position on any axis with minimum cost.
Only split at
Relax the fixed
For small inputs it is possible that there is no overlap in valid ranges, in
which case the algorithm falls back to choosing just one. Similarly to the fixed
A cluster "underfill" cost is introdued to encourage bigger clusters. For
example, in the figure below a split position is being considered for clusters
in the range [1, 4]. The split candidate would produce a node of 2.75 clusters
on the left and 1.25 on the right. This results in costUnderfill constant, but a transfer function to
model the true cost, of e.g. perf or memory, would be ideal.
Bounding box overlap is bad for ray tracing because rays must enter both while
in the overlap volume. A cost is added for overlapping bounding boxes, ver much
like SAH it is just costOverlap constant.
If provided, adjacency is integrated by adding the cut cost - the sum of weights of all item connections broken by the split - to each candidate split position. Ratio cut [Wei and Cheng, 1989] is used to avoid degenerate solutions. The cut cost is arbitrarily scaled by the number of items in the node to be SAH relative and added to the other costs above.
To compute the cut cost, the adjacency data is rebuilt to reference node items before each iteration of recursive node splitting. This allows cut costs to be computed with a prefix sum scan of summed starting and ending connection weights.
To explain, there are initially three arrays, sorted by item centroids in X, Y and Z respectively. After splitting, these arrays are partitioned, maintaining sorted order within nodes. These hold original item indices and in fact trivially hold the clustering result after splitting. The input adjacency arrays index original items, but we instead need the index in those initially sorted arrays. This is done by duplicating the adjacency arrays and scatter writing their node-sorted indices. The image below shows this for one axis.
When computing cut costs for a node, an array of summed weights is created. The image below shows an example with unit weights. The array is initialized with the sum of connecting item weights - positive for connections to the right and negative for connections to the left. Connections to other nodes are ignored. The reindexed adjacency arrays trivially give this information, comparing the connection index with the current item's index and the node boundaries. The weights array is then prefix summed to obtain the cut cost for each position in the node.
The BibTex entry to cite nv_cluster_builder is
@online{nv_cluster_builder,
title = {{{NVIDIA}}\textregistered{} {nv_cluster_builder}},
author = {{NVIDIA}},
year = 2025,
url = {https://github.com/nvpro-samples/nv_cluster_builder},
urldate = {2025-01-30},
}Clusters are created by making recursive axis aligned splits. This is useful as it greatly reduces the search space and improves performance when clusters are used in ray tracing. However, more general clustering solutions than axis aligned splits are not considered.
Recursively splitting is done greedily, picking the lowest cost split which may not be a global optimum.
The algorithm is primarily spatial due to splitting in order of centroids, but
solutions can be skewed by adjusting the costs in nvcluster::Config and
adjacency weights in nvcluster::Graph::connectionWeights. For example, choosing
adjacency weights to represent connected triangles or number of shared vertices
can result in more vertex reuse within clusters. Weights may also represent face
normal similarity or a balance of multiple attributes.
Badly chosen weights can result in degenerate solutions where recursive bisection splits off single leaves. This is both slow and rarely desirable.
Parallel execution is only supported with libstdc++ and MSVC STL, not libc++.