RuntimeError: Cannot launch Triton kernel since n = 46336 exceeds the maximum CUDA blocksize = 65535.
ile /usr/local/lib/python3.10/dist-packages/unsloth/kernels/utils.py:23, in calculate_settings(n)
21 # CUDA only supports 65535 - 2^16-1 threads per block
22 if BLOCK_SIZE > MAX_FUSED_SIZE:
---> 23 raise RuntimeError(f"Cannot launch Triton kernel since n = {n} exceeds "\
24 f"the maximum CUDA blocksize = {MAX_FUSED_SIZE}.")
25 num_warps = 4
26 if BLOCK_SIZE >= 32768: num_warps = 32
H100 80GB