There are many strong libraries for numerical computing. Most of them are written in C, C++, and Fortran, with excellent Rust wrappers and Python bindings on top.
Where Rust is especially convenient is dependency management and reproducible benchmarking, making it a good place to line up apples-to-apples comparisons across native crates and their Python bindings. NumWars exists for the same reason StringWars exists for StringZilla: to compare NumKong against mainstream CPU stacks on the workloads it was built for, including:
ndarrayandnalgebrafor dense tensor and linear algebra kernels.faerandmatrixmultiplyfor GEMM-like Rust baselines.geofor geographic distances.polarsand reduction-heavy analytics workloads.NumPy,SciPy, andscikit-learnon Python.
Of course, the APIs and internal kernels of those projects are different.
So this repository focuses on the workload families NumKong was designed for and compares their effective throughput using the native unit for each operation family instead of forcing everything into fake global ops/s.
Important
The numbers below are reference measurements collected on Apple M5 Pro (P-cores) in single-threaded mode. All benchmarks were run single-threaded on an idle system. They will vary with CPU model, compiler flags, BLAS backend, and problem size. Rebuild and rerun on your own hardware before treating them as absolute.
NumKong packed dots are mixed-precision by design. i8 inputs produce i32 outputs. bf16 and f16 inputs produce f32 outputs. f32 inputs produce f64 outputs. The mainstream baselines shown here keep f32 β f32. Compared to Rust projects, it means:
NumKong:
numkong::Tensor::dots_packed i8 β i32 βββββββββββββββββββββββββββββββββββ 2,783.00 GSO/s
numkong::Tensor::dots_packed bf16 β f32 ββββββββββββββββ 1,250.80 GSO/s
numkong::Tensor::dots_packed f16 β f32 ββββββββββββββββ 1,249.70 GSO/s
numkong::Tensor::dots_packed f32 β f64 βββ 197.79 GSO/s
Alternatives:
nalgebra::DMatrix Γ DMatrixα΅ f32 β f32 ββ 118.72 GSO/s
ndarray::ArrayBase::dot f32 β f32 ββ 117.51 GSO/s
faer::linalg::matmul::matmul f32 β f32 ββ 117.50 GSO/s
matrixmultiply::sgemm f32 β f32 ββ 116.77 GSO/s
Compared to Python:
NumKong:
numkong.dots_packed i8 β i32 ββββββββββββββββββββββββββββββββββββββββββββ 2,621.97 GSO/s
numkong.dots_packed bf16 β f32 ββββββββββββββββββββ 1,142.19 GSO/s
numkong.dots_packed f16 β f32 βββββββββββββββββββ 1,134.69 GSO/s
numkong.dots_packed f32 β f64 ββββ 194.15 GSO/s
Alternatives:
numpy.matmul f32 β f32 βββββββββββββββββββββββββββββββ 1,854.27 GSO/s
See dots/README.md for details.
Single-pair vector kernels at 2048 dimensions. This lists Dot products and true Euclidean distances measurements into one throughput-sorted view. NumKong keeps its mixed-precision promotions, while the baseline libraries mostly stay in their input type.
Compared to Rust projects, it means:
NumKong:
numkong::Dot::dot i8 β i32 βββββββββββββββββββββββββββββββββββββ 37.41 GSO/s
numkong::Dot::dot u8 β u32 βββββββββββββββββββββββββββββββββββββ 37.34 GSO/s
numkong::Euclidean::euclidean i8 β f32 βββββββββββββββββββββββββββββββββββ 34.91 GSO/s
numkong::Euclidean::euclidean u8 β f32 βββββββββββββββββββββββββββββββββββ 34.85 GSO/s
numkong::Dot::dot bf16 β f32 βββββββββββββββ 14.83 GSO/s
numkong::Euclidean::euclidean bf16 β f32 ββββββ 5.54 GSO/s
numkong::Dot::dot f32 β f64 βββ 2.50 GSO/s
numkong::Euclidean::euclidean f32 β f64 βββ 2.50 GSO/s
Alternatives:
nalgebra::Matrix::dot f32 β f32 βββββββββββββ 12.95 GSO/s
ndarray::ArrayBase::dot f32 β f32 βββββββββββββ 12.77 GSO/s
nalgebra (a - b).norm() f32 β f32 ββββββββ 7.74 GSO/s
ndarray sqrt((a - b)Β·(a - b)) f32 β f32 ββββββββ 7.61 GSO/s
Compared to Python:
NumKong:
numkong.euclidean i8 β f32 ββββββββββββββββββββββββββββββββββββββββββββββββββ 10.32 GSO/s
numkong.euclidean u8 β f32 ββββββββββββββββββββββββββββββββββββββββββββββββββ 10.24 GSO/s
numkong.angular u8 β f32 βββββββββββββββββββββββββββββββββββββββββββββββββ 10.07 GSO/s
numkong.angular i8 β f32 βββββββββββββββββββββββββββββββββββββββββββββββββ 10.04 GSO/s
numkong.dot i8 β f32 ββββββββββββββββββββββββββββββββββββββββββββββββ 9.81 GSO/s
numkong.dot u8 β f32 ββββββββββββββββββββββββββββββββββββββββββββββ 9.37 GSO/s
numkong.dot f64 β f32 ββββββββββββ 2.48 GSO/s
numkong.euclidean f32 β f32 βββββββββββ 2.15 GSO/s
Alternatives:
numpy.dot f32 β f32 βββββββββββββββββββ 3.81 GSO/s
numpy.dot f64 β f64 ββββββββββββββββββ 3.68 GSO/s
See similarity/README.md for details.
Matrix-vs-matrix comparisons at 2048 rows by 2048 dimensions. These are the packed many-to-many siblings of the pairwise spatial kernels above. The merged lists below include angular and euclidean metrics, and the headline unit is GSO/s.
Compared to Rust projects, it means:
NumKong:
numkong::Tensor::euclideans_packed u8 β f32 βββββββββββββββββββββββββββββββ 888.74 GSO/s
numkong::Tensor::euclideans_packed i8 β f32 βββββββββββββββββββββββββββββββ 887.85 GSO/s
numkong::Tensor::angulars_packed u8 β f32 βββββββββββββββββββββββββββββ 830.14 GSO/s
numkong::Tensor::angulars_packed i8 β f32 βββββββββββββββββββββββββββββ 830.13 GSO/s
numkong::Tensor::euclideans_packed bf16 β f32 βββββββββββββββββββ 524.00 GSO/s
numkong::Tensor::angulars_packed bf16 β f32 ββββββββββββββββββ 502.45 GSO/s
numkong::Tensor::euclideans_packed f32 β f64 ββββ 92.93 GSO/s
numkong::Tensor::angulars_packed f32 β f64 ββββ 92.52 GSO/s
Alternatives:
ndarray euclidean matrix f32 β f32 ββ 57.64 GSO/s
ndarray angular matrix f32 β f32 ββ 56.98 GSO/s
nalgebra angular matrix f32 β f32 ββ 49.95 GSO/s
nalgebra euclidean matrix f32 β f32 ββ 49.79 GSO/s
Compared to Python through SciPy cdist:
NumKong:
numkong.euclideans_packed u8 β f32 βββββββββββββββββββββββββββββββββββββββββ 425.91 GSO/s
numkong.euclideans_packed i8 β f32 ββββββββββββββββββββββββββββββββββββββββ 408.64 GSO/s
numkong.angulars_packed i8 β f32 ββββββββββββββββββββββββββββββββββββββ 386.96 GSO/s
numkong.angulars_packed u8 β f32 βββββββββββββββββββββββββββββββββββ 364.01 GSO/s
numkong.angulars_packed f32 β f64 ββββββββ 79.26 GSO/s
numkong.euclideans_packed f32 β f64 βββββ 52.95 GSO/s
Alternatives:
scipy.cdist euclidean f32 β f64 β 5.09 GSO/s
scipy.cdist cosine f32 β f64 β 1.30 GSO/s
See similarities/README.md for details.
Bandwidth-sensitive elementwise kernels β add and scale β over 1,000,000 elements. Sum shown as representative sample. In Rust:
NumKong:
numkong::EachSum i8 β i8 βββββββββββββββββββββββββββββββββββββββββββββββββββ 111.47 GB/s
numkong::EachSum f32 β f32 βββββββββββββββββββββββββββββββββββββββββββββ 97.55 GB/s
numkong::EachSum f16 β f16 βββββββββββββββββββββββββββββββββββββββββββββ 96.56 GB/s
Alternatives:
nalgebra::add f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 95.31 GB/s
ndarray::add f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 94.84 GB/s
serial code f32 β f32 βββββββββββββββββββββββββββββββββββββββββββ 94.06 GB/s
In Python:
numpy.add i8 β i8 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ 143.56 GB/s
numkong.add i8 β i8 βββββββββββββββββββββββββββββββββββββββββββββββ 123.77 GB/s
numkong.add f32 β f32 βββββββββββββββββββββββββββββββββββββββββββββ 118.39 GB/s
numpy.add f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 115.32 GB/s
numpy.add f64 β f64 βββββββββββββββββββββββββββββββββββββββββββ 114.37 GB/s
numkong.add f16 β f16 βββββββββββββββββββββββββββββββββββββββββ 107.29 GB/s
numkong.add f64 β f64 ββββββββββββββββββββββββββββββββββββββ 100.01 GB/s
numkong.add bf16 β bf16 ββββββββββββββββββββββββββββ 73.27 GB/s
numpy.add f16 β f16 ββ 4.08 GB/s
See each/README.md for details.
Horizontal reductions over 1,000,000 elements. The suite covers sum and row-wise L2 norms. In Rust:
polars::ChunkedArray::sum f64 β f64 ββββββββββββββββββββββββββββββββββββββ 113.57 GB/s
polars::ChunkedArray::sum f32 β f32 βββββββββββββββββββββββββββββββββββββ 110.70 GB/s
ndarray::ArrayBase::sum f64 β f64 ββββββββββββββββββββββββββββββββββ 99.49 GB/s
ndarray::ArrayBase::sum f32 β f32 βββββββββββββββββ 49.83 GB/s
numkong::reduce_moments().sum f32 β f64 ββββ 10.31 GB/s
serial sum loop f32 β f32 βββ 8.50 GB/s
Row-wise L2 norms over a 2048Γ2048 matrix:
ndarray row norms f64 β f64 βββββββββββββββββββββββββββββββββββββββ 89.72 GB/s
ndarray row norms f32 β f32 ββββββββββββββββββββββββ 53.24 GB/s
numkong::Dot self-dot + sqrt bf16 β f32 ββββββββββββββ 30.64 GB/s
numkong::Dot self-dot + sqrt f64 βββββββββββ 23.44 GB/s
serial row norms loop f64 β f64 ββββββββ 17.95 GB/s
numkong::Dot self-dot + sqrt f32 βββββ 10.60 GB/s
serial row norms loop f32 β f32 ββββ 9.20 GB/s
In Python over 1,000,000 elements:
numpy.sum f64 β f64 βββββββββββββββββββββββββββββββββββββββββββββββββββ 61.26 GB/s
numpy.sum f32 β f32 βββββββββββββββββββββββββββββ 33.92 GB/s
numpy.linalg.norm f64 β f64 ββββββββββββββββββββββββββ 30.26 GB/s
numkong.sum u8 β u8 βββββββββββββββββββ 21.78 GB/s
numkong.sum i8 β i8 ββββββββββββββββββ 21.40 GB/s
numpy.linalg.norm f32 β f64 βββββββββββββββββ 20.15 GB/s
numkong.norm f64 β f64 βββββββββββββββ 17.44 GB/s
numkong.sum f64 β f64 ββββββββββββββ 16.34 GB/s
numkong.norm f32 β f64 βββββββββββββ 15.10 GB/s
numkong.sum f32 β f32 ββββββββ 9.49 GB/s
numpy.sum i8 β i8 ββββββ 6.73 GB/s
See reduce/README.md for details.
ColBERT-style late interaction with 2048 query vectors, 2048 document vectors, and 2048 dimensions. NumKong promotes f32 β f64 here as well, while ndarray stays in f32. In Rust:
NumKong:
numkong::MaxSimPackedMatrix::score f32 β f64 βββββββββββββββββββββββββββββ 1,483.41 GSO/s
numkong::MaxSimPackedMatrix::score bf16 β f32 ββββββββββββββββββββ 983.57 GSO/s
numkong::MaxSimPackedMatrix::score f16 β f32 ββββββββββββββββββββ 980.33 GSO/s
Alternatives:
ndarray Q @ Dα΅ max-reduce f32 β f32 ββ 58.37 GSO/s
Compared to Python:
NumKong:
numkong.maxsim_packed f32 β f64 ββββββββββββββββββββββββββββββββββββββββββ 2,425.72 GSO/s
numkong.maxsim_packed bf16 β f32 ββββββββββββββββββββββ 1,236.30 GSO/s
numkong.maxsim_packed f16 β f32 ββββββββββββ 696.78 GSO/s
Alternatives:
numpy matmul f32 β f32 βββββββββββββββββββββββββββ 1,525.56 GSO/s
See maxsim/README.md for details.
Throughput over 2048 coordinate pairs. The unit is MP/s, or million coordinate pairs per second. The merged lists below include both Haversine and Vincenty distances.
Compared to Rust projects, it means:
NumKong:
numkong::haversine f32 β f32 ββββββββββββββββββββββββββββββββββββββββββββ 491.98 MP/s
numkong::haversine f64 β f64 ββββββββββββββ 149.72 MP/s
numkong::vincenty f32 β f32 βββββββ 71.64 MP/s
numkong::vincenty f64 β f64 ββ 13.73 MP/s
Alternatives:
geo::Haversine distance f32 β f32 βββββββββββββ 136.96 MP/s
geo::Haversine distance f64 β f64 βββββββββ 92.48 MP/s
geo::Vincenty distance f64 β f64 β 2.76 MP/s
Compared to Python and its alternatives:
NumKong:
numkong.haversine f32 β f32 ββββββββββββββββββββββββββββββββββββββββ 444.38 MP/s
numkong.haversine f64 β f64 ββββββββββββ 132.85 MP/s
numkong.vincenty f32 β f32 ββββββ 65.89 MP/s
numkong.vincenty f64 β f64 β 11.93 MP/s
Alternatives:
geopy.distance.great_circle f64 β f64 0.47 MP/s
geopy.distance.geodesic f64 β f64 0.03 MP/s
See geospatial/README.md for details.
Throughput over point clouds with 2048 3D points each. The unit is MP/s, or million 3D points per second. The labels include the full return signature so RMSD and Kabsch can share one sorted list cleanly. In Rust:
NumKong:
numkong::MeshAlignment::rmsd f16 β f16 ββββββββββββββββββββββββββββββββ 2,864.47 MP/s
numkong::MeshAlignment::rmsd bf16 β bf16 ββββββββββββββββββββββββββββββββ 2,861.70 MP/s
numkong::MeshAlignment::rmsd f64 β f64 βββββββββββββββββββββ 1,859.32 MP/s
numkong::MeshAlignment::rmsd f32 β f32 βββββββββββββββββββ 1,626.67 MP/s
numkong::MeshAlignment::kabsch f16 β f16 ββββββββ 696.00 MP/s
numkong::MeshAlignment::kabsch bf16 β bf16 ββββββββ 691.01 MP/s
numkong::MeshAlignment::umeyama bf16 β bf16 ββββββββ 673.50 MP/s
numkong::MeshAlignment::umeyama f16 β f16 βββββββ 614.06 MP/s
numkong::MeshAlignment::kabsch f32 β f32 βββββ 396.52 MP/s
numkong::MeshAlignment::umeyama f32 β f32 βββββ 376.48 MP/s
numkong::MeshAlignment::kabsch f64 β f64 ββββ 331.70 MP/s
numkong::MeshAlignment::umeyama f64 β f64 ββββ 325.16 MP/s
Alternatives:
nalgebra-based RMSD f32 β f32 βββββββ 634.04 MP/s
nalgebra-based Kabsch f32 β f64 ββββ 283.16 MP/s
nalgebra-based Umeyama f32 β f64 βββ 255.14 MP/s
Compared to Python and its alternatives:
NumKong:
numkong.rmsd f64 β f64 βββββββββββββββββββββββββββββββ 1,311.77 MP/s
numkong.rmsd f32 β f64 βββββββββββββββββββββββββββββ 1,228.00 MP/s
numkong.kabsch f32 β f64 βββββββββ 360.08 MP/s
numkong.umeyama f32 β f64 ββββββββ 327.01 MP/s
numkong.umeyama f64 β f64 βββββββ 296.67 MP/s
numkong.kabsch f64 β f64 βββββββ 285.81 MP/s
Alternatives:
numpy-based RMSD f32 β f64 βββ 124.48 MP/s
numpy-based RMSD f64 β f64 βββ 117.30 MP/s
biopython SVDSuperimposer (Kabsch) f32 β f64 2.88 MP/s
biopython SVDSuperimposer (Kabsch) f64 β f64 2.92 MP/s
See mesh/README.md for details.
Every Rust benchmark is a Criterion harness behind a Cargo feature gate. Run one suite at a time or all at once:
# One suite β default 2048-element workload
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features bench_similarity --bench bench_similarity
# All suites
RUSTFLAGS="-C target-cpu=native" \
cargo bench --features allTuning knobs (environment variables):
| Variable | Default | Purpose |
|---|---|---|
NUMWARS_DIMS |
2048 | Vector / matrix dimension shared by most suites |
NUMWARS_DIMS_HEIGHT |
2048 | Row count for GEMM workloads (dots, maxsim) |
NUMWARS_DIMS_WIDTH |
2048 | Column count for GEMM workloads (dots, maxsim) |
NUMWARS_DIMS_DEPTH |
2048 | Shared (contraction) dimension for GEMM workloads |
NUMWARS_FILTER |
(none) | Regex to select benchmarks by name |
NUMWARS_WARMUP_SECONDS |
3.0 | Criterion warm-up time |
NUMWARS_PROFILE_SECONDS |
10.0 | Criterion measurement time |
NUMWARS_SAMPLE_SIZE |
50 | Criterion sample count |
Install with uv and run any suite directly:
uv run --with "numkong,numpy,scipy,tabulate,ml_dtypes" \
python similarity/bench.pyOr install all extras and run from the repo root:
pip install -e ".[similarity,each,dots,geospatial,mesh,reduce,similarities]"
python dots/bench.py
python similarities/bench.py- similarity/README.md
- similarities/README.md
- dots/README.md
- each/README.md
- reduce/README.md
- maxsim/README.md
- geospatial/README.md
- mesh/README.md
Apache 2.0. See LICENSE.
