cuTile Rust (cutile-rs) is a research project providing a safe, tile-based kernel programming DSL for the Rust programming language.
It features a safe host-side API for passing tensors to asynchronously executed kernel functions.
We are excited to release this research project as a demonstration of how GPU programming can be made available in the Rust ecosystem. The software is in an early stage (-alpha) and under active development: you should expect bugs, incomplete features, and API breakage as we work to improve it. That being said, we hope you'll be interested to try it in your work and help shape its direction by providing feedback on your experience.
Please see the Contributing.md if you're interested in contributing to the project.
- NVIDIA GPU with
sm_80or >=sm_100compute capability.sm_90is not yet supported. - CUDA 13.2.
- LLVM 21 with MLIR.
- Rust 1.75+ (nightly required for some features)
- Linux (tested on Ubuntu 24.04)
To install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default nightlyInstall CUDA 13.2 on your OS by following these instructions: https://developer.nvidia.com/cuda-downloads
To install LLVM-21 with MLIR (see https://apt.llvm.org/ for details):
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 21
sudo apt-get install libmlir-21-dev mlir-21-tools- Set the env var
CUDA_TOOLKIT_PATHto CUDA 13.2. - Ensure
llvm-configpoints to LLVM 21. Required bymelior. - Set the env var
CUDA_TILE_USE_LLVM_INSTALL_DIRto llvm-21 (e.g./usr/lib/llvm-21). Required bycuda-tile-rs.
The environment needs access to llvm-config in order to resolve llvm (and mlir)-related dependencies.
You can configure multiple llvm builds using update-alternatives:
sudo update-alternatives --install /usr/bin/llvm-config llvm-config /usr/lib/llvm-21/bin/llvm-config 1
sudo update-alternatives --config llvm-configExample .env/config.toml:
[env]
CUDA_TOOLKIT_PATH = { value = "/usr/local/cuda-13", relative = false }
CUDA_TILE_USE_LLVM_INSTALL_DIR = { value = "/usr/lib/llvm-21", relative = false }This project depends on the cuda-tile MLIR dialect. Please follow the instructions here to set it up.
Run the hello world example via cargo run --example hello_world.
If everything works, you should see: Hello, I am tile <0, 0, 0> in a kernel with <1, 1, 1> tiles.
use cuda_async::device_operation::DeviceOperation;
use cutile::{self, api, tile_kernel::IntoDeviceOperationPartition};
use my_module::add_async as add;
#[cutile::module]
mod my_module {
use cutile::core::*;
#[cutile::entry()]
fn add<const S: [i32; 2]>(
z: &mut Tensor<f32, S>,
x: &Tensor<f32, { [-1, -1] }>,
y: &Tensor<f32, { [-1, -1] }>,
) {
let tile_x = load_tile_like_2d(x, z);
let tile_y = load_tile_like_2d(y, z);
z.store(tile_x + tile_y);
}
}
fn main() -> () {
let x = api::ones([32, 32]).arc();
let y = api::ones([32, 32]).arc();
let z = api::zeros([32, 32]).partition([4, 4]);
let (_z, _x, _y) = add(z, x, y).sync();
}The above example defines a device-side module named my_module, which contains the tile kernel add.
The add kernel is marked as an entry point, allowing it to be executed from the host-side (e.g. the main function).
Our kernel is defined such that x and y are input tensors, and z is an output tensor.
On the host-side, we allocate our device-side tensors x, y and z.
The kernel indicates that z must be mutable. Since the same tile kernel executes in parallel by many tile threads, we will need a way to provide each tile thread exclusive access to z. It is enough to wrap x and y in an Arc (see cuda-async for details), however, the tensor z is partitioned into a grid of 4x4 sub-tensors. In cuTile Rust,
any &mut Tensor<...> requires the host to pass a Partition<Tensor<T>> as the argument. Any &Tensor<...> requires the
host to pass an Arc<Tensor<...>> as an argument.
The expression add(z, x, y) constructs a representation of a kernel launcher: A structure which encodes how the GPU applies the kernel to the given arguments. By default, because we have partitioned z into a grid of 4x4 subtensors, the kernel launcher will pick a launch grid of (8, 8, 1). Each (x, y, z) coordinate in the launch grid corresponds to a tile thread.
The sync method picks the default device on the system and synchronously JIT-compiles the kernel to the default device's architecture and immediately executes the kernel with the provided arguments.
Before executing the user-defined kernel on the device-side, each tile thread is initialized by selecting
a distinct sub-tensor from the partitioning of z as the &mut Tensor<...> kernel parameter.
Each tile thread has exclusive access to a distinct sub-tensor within the partition of z,
allowing for safe parallel mutable access.
- To run the above example, run
cargo run --example add_basicwithin thecutile-examplescrate. - More kernels and usage examples of the host-side API can be found here here.
- CUDA Tile dialect bindings:
cargo test --package cuda-tile-rs - cuTile Rust Compiler:
cargo test --package cutile-compiler - cuTile Rust Library:
cargo test --package cutile - Examples:
bash cutile-examples/run.sh - Benchmarks:
cargo bench - Everything:
./scripts/run_all_tests.sh(or pipe to a log file:./scripts/run_all_tests.sh 2>&1 | tee test_run.log)
The cuda-bindings crate is licensed under NVIDIA Software License: LICENSE-NVIDIA.
All other crates are licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0