parallel_playground (moe_cuda)

It says moe_cuda, but I'm just experimenting with a bunch of different parallelism strategies, like EP, TP + SP, Ring attention, etc.

Build

./run_cmake.sh

Python Development Build

./develop.sh

This mirrors the h100_gdn_cuda flow: it builds a moe_cuda Python extension through setup.py + CMake, then symlinks the built .so into the repo root.

Or manually:

cmake -S . -B build
cmake --build build -j"$(nproc)"

Required dependencies

NCCL is mandatory and is discovered via find_package(NCCL) first, with NCCL_ROOT fallback search.
If Torch_DIR points to a pip/venv torch install, CMake also checks sibling site-packages/nvidia/nccl automatically.

JIT Environment

JIT include root is environment-driven. There is no hardcoded library subfolder in include path resolution.

Compiler::init_static_vars(library_root, cuda_home) resolves include path as:

LIBRARY_INCLUDE_PATH (if set)
${library_root}/include (fallback)

Recommended .env variables:

export LIBRARY_ROOT_PATH="/abs/path/to/library/root"
export LIBRARY_INCLUDE_PATH="/abs/path/to/include/root"
export CUDA_HOME_PATH="/usr/local/cuda"
export JIT_CACHE_DIR="/tmp/.moe_cuda"
export JIT_USE_NVRTC=1

Roadmap:

weird bug happening at high K for kernel2 to fix
implement a fused dispatch + swiglu grouped gemm
- inside here, enforce transposed sfa majors
- keep linking this together, and also remember to zero out the comm_comp_barriers after each dispatch (these are actually initialized inside so it's ok)
- also remember to zero out the tk parallel tensor barrier after each call in an epilogue kernel, as well as any other metadata
check all a2a kernels work with dp_size > 1
[] fuse gate matmul + grouped topk kernels into one kernel
look at fusing SP + TP kernels to match megatron forward pass

TO-DO today:

ncu swiglu grouped gemm across different shapes and tactics
work on testing and benchmarking swiglu + full a2a, find optimization spots
i suspect a dispatch gemm is what's next
compare performance of fused swiglu with liger kernel

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
csrc		csrc
moe_cuda/include		moe_cuda/include
python		python
src		src
test/test_cpp		test/test_cpp
third-party		third-party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
README.md		README.md
develop.sh		develop.sh
moe_cuda.cpython-313-x86_64-linux-gnu.so		moe_cuda.cpython-313-x86_64-linux-gnu.so
pyproject.toml		pyproject.toml
run_cmake.sh		run_cmake.sh
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parallel_playground (moe_cuda)

Build

Python Development Build

Required dependencies

JIT Environment

Roadmap:

TO-DO today:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

parallel_playground (moe_cuda)

Build

Python Development Build

Required dependencies

JIT Environment

Roadmap:

TO-DO today:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages