The current MFU estimator assumes all layers have the same structure. Models like Qwen3.5-397B-A17B have non-uniform layer structures, which the estimator doesn't account for. The FLOPs and memory bandwidth estimates would be inaccurate for these architectures. Refer to the following comment:
Other models with varying layer widths or mixed configurations may also not fit the current assumptions
Should we consider cases where the model's layer structures differ? For example, the latest Qwen3.5-397B-A17B: https://huggingface.co/Qwen/Qwen3.5-397B-A17B. Of course, I think this issue can be addressed later.
Originally posted by @sufeng-buaa in #19395 (comment)
The current MFU estimator assumes all layers have the same structure. Models like Qwen3.5-397B-A17B have non-uniform layer structures, which the estimator doesn't account for. The FLOPs and memory bandwidth estimates would be inaccurate for these architectures. Refer to the following comment:
Other models with varying layer widths or mixed configurations may also not fit the current assumptions
Should we consider cases where the model's layer structures differ? For example, the latest Qwen3.5-397B-A17B: https://huggingface.co/Qwen/Qwen3.5-397B-A17B. Of course, I think this issue can be addressed later.
Originally posted by @sufeng-buaa in #19395 (comment)