current status: CI use tokenizer from "./test/assets/test_tiktoken.model", and has vacab size = 2256. FP8 GEMM requires matrix size to be divisible by 16. without any sharding, we have shape 2256 / 16 = 141, and 141 is not divisible by world size 2/4/8
Option 1: customizer or add a new tokenizer in CI with vacab size = 2560 (2560 / 16 = 160). it's enough to test 2-way, 4-way, 8-way TP
Option 2: enable padding for fp8 GEMM at the cost of memeory spike and 20% perf regression
repro
CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh and look for the following log
Building llama3 debugmodel with ModelArgs(dim=256, n_layers=8, n_heads=16, n_kv_heads=None, vocab_size=2256