Skip to content

feat: Make ak.combinations faster on GPU by using cp.searchsorted to compute output list indexes#3795

Merged
ianna merged 1 commit intoscikit-hep:mainfrom
shwina:fix-list-index-to-use-searchsorted
Jan 11, 2026
Merged

feat: Make ak.combinations faster on GPU by using cp.searchsorted to compute output list indexes#3795
ianna merged 1 commit intoscikit-hep:mainfrom
shwina:fix-list-index-to-use-searchsorted

Conversation

@shwina
Copy link
Copy Markdown
Contributor

@shwina shwina commented Jan 10, 2026

By far the bottleneck in ak.combinations for the GPU is the Python loop for computing the output list indexes. This can be done in a vectorized fashion using cp.searchsorted. As written, this does multiple memset operations in a loop which can be slow.

The improvement is massive.

Before:

lists=100,000 total_values=1,000,668 mean_len≈10
estimated total_pairs=5,009,024
Time taken for ak.combinations: 2.16 seconds

After:

lists=100,000 total_values=1,000,668 mean_len≈10
estimated total_pairs=5,009,024
Time taken for ak.combinations: 0.0015 seconds

Benchmarking script used:

import numpy as np
import cupy as cp
import awkward as ak
import timeit


# Create random jagged data
np.random.seed(42)
num_lists = 100_000
mean_len = 10
max_len = 40

# Generate random lengths and build offsets
lengths = np.random.poisson(lam=mean_len, size=num_lists).astype(np.int64)
lengths = np.clip(lengths, 0, max_len)
offsets = np.empty(num_lists + 1, dtype=np.int64)
offsets[0] = 0
np.cumsum(lengths, out=offsets[1:])

total_values = int(offsets[-1])
values = np.arange(total_values, dtype=np.int64)

# Build awkward array
layout = ak.contents.ListOffsetArray(
    ak.index.Index64(offsets),
    ak.contents.NumpyArray(values)
)
arr = ak.Array(layout)

# Estimate total pairs
total_pairs = sum(n * (n - 1) // 2 for n in lengths)
print(f"lists={num_lists:,} total_values={total_values:,} mean_len≈{mean_len}")
print(f"estimated total_pairs={total_pairs:,}")

# Move to CUDA
gpu_arr = ak.to_backend(arr, "cuda")
cp.cuda.get_current_stream().synchronize()

# Benchmark using timeit
# warmup:
ak.combinations(gpu_arr, n=2)
result = timeit.timeit(lambda: ak.combinations(gpu_arr, n=2), number=10)
print(f"Time taken for ak.combinations: {result / 10:.4f} seconds")

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.65%. Comparing base (43e2403) to head (415db5b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shwina shwina changed the title Make ak.combinations faster on GPU by using cp.searchsorted to compute output list indexes feat: Make ak.combinations faster on GPU by using cp.searchsorted to compute output list indexes Jan 10, 2026
@github-actions
Copy link
Copy Markdown

The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3795

Copy link
Copy Markdown
Member

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina - Thanks for this contribution — the performance improvement is genuinely impressive. Moving the list‑index computation from a Python loop to a vectorized cp.searchsorted call is exactly the kind of backend‑aware optimization that pays off at scale. The benchmark numbers you included make the impact very clear.

What looks great

  • Massive speedup: reducing runtime from ~2.16s → ~0.0015s for the provided workload is a transformative improvement for GPU users.

Thank you! 🙏

@ianna ianna merged commit c6fc079 into scikit-hep:main Jan 11, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants