It says moe_cuda, but I'm just experimenting with a bunch of different parallelism strategies, like EP, TP + SP, Ring attention, etc.
./run_cmake.sh./develop.shThis mirrors the h100_gdn_cuda flow: it builds a moe_cuda Python extension through setup.py + CMake, then symlinks the built .so into the repo root.
Or manually:
cmake -S . -B build
cmake --build build -j"$(nproc)"- NCCL is mandatory and is discovered via
find_package(NCCL)first, withNCCL_ROOTfallback search. - If
Torch_DIRpoints to a pip/venv torch install, CMake also checks siblingsite-packages/nvidia/ncclautomatically.
JIT include root is environment-driven. There is no hardcoded library subfolder in include path resolution.
Compiler::init_static_vars(library_root, cuda_home) resolves include path as:
LIBRARY_INCLUDE_PATH(if set)${library_root}/include(fallback)
Recommended .env variables:
export LIBRARY_ROOT_PATH="/abs/path/to/library/root"
export LIBRARY_INCLUDE_PATH="/abs/path/to/include/root"
export CUDA_HOME_PATH="/usr/local/cuda"
export JIT_CACHE_DIR="/tmp/.moe_cuda"
export JIT_USE_NVRTC=1- weird bug happening at high K for kernel2 to fix
- implement a fused dispatch + swiglu grouped gemm
- inside here, enforce transposed sfa majors
- keep linking this together, and also remember to zero out the comm_comp_barriers after each dispatch (these are actually initialized inside so it's ok)
- also remember to zero out the tk parallel tensor barrier after each call in an epilogue kernel, as well as any other metadata
- check all a2a kernels work with dp_size > 1
- [] fuse gate matmul + grouped topk kernels into one kernel
- look at fusing SP + TP kernels to match megatron forward pass
- ncu swiglu grouped gemm across different shapes and tactics
- work on testing and benchmarking swiglu + full a2a, find optimization spots
- i suspect a dispatch gemm is what's next
- compare performance of fused swiglu with liger kernel