-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Description
Issue Description
Loading meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 with transformers==4.53.0.dev0 or 4.51.0 initially failed with TypeError: argument of type 'NoneType' is not iterable in transformers.modeling_utils.post_init. This occurred with Llama4ForConditionalGeneration and AutoModelForCausalLM on a 4x NVIDIA RTX A6000 setup (~196GB VRAM, CUDA 12.4, Python 3.12.3, Ubuntu 24.04.2). A similar TypeError was seen previously with 2x GPUs. The issue was resolved using transformers==4.51.0 and config={"parallel_style": "none"}.
Steps to Reproduce
-
Install dependencies:
pip install torch==2.4.1 torchvision==0.19.1 accelerate==1.7.0 compressed-tensors==0.9.4 transformers==4.51.0
-
Confirm model files (~389GB, 84 .safetensors) at /mnt/data/ai_super_palace/models/llama4/.
-
Run (failed):
from transformers import Llama4ForConditionalGeneration
model = Llama4ForConditionalGeneration.from_pretrained(
'/mnt/data/ai_super_palace/models/llama4',
torch_dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True
)
print('Model OK') -
Error:
TypeError: argument of type 'NoneType' is not iterable
5: Workaround (succeeded):
import os
import torch
os.environ["TORCHVISION_DISABLE_NMS"] = "1"
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'/mnt/data/ai_super_palace/models/llama4',
torch_dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True,
offload_folder="/mnt/data/ai_super_palace/models/llama4/offload",
config={"parallel_style": "none"}
)
print('Model OK')
Environment
Transformers: 4.51.0 (or 4.53.0.dev0, commit b369a65)
Python: 3.12.3
PyTorch: 2.4.1
CUDA: 12.4
Accelerate: 1.7.0
Compressed-tensors: 0.9.4
OS: Ubuntu 24.04.2 LTS
Hardware: 4x NVIDIA RTX A6000 (~196GB VRAM)
Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
Additional Details
Model card requires transformers>=4.51.0, supports FP8 via compressed-tensors.
Previous errors: NameError: init_empty_weights, ValueError: requires accelerate, ImportError: requires compressed-tensors.
Config check (cat config.json | grep -E "parallel|tp_plan") empty.
Warnings: Uninitialized weights (feed_forward.experts.*) and offloaded parameters (expected due to VRAM).
Request
Is the TypeError a known llama4 bug? Should parallel_style be explicitly set in config.json?
Guidance on fine-tuning for sentiment analysis (~100–150GB/day, ~85–90% accuracy).
Sharing workaround for community benefit.
Logs
See traceback above. config.json (40KB) available if needed.
Thank you!