Skip to content

TypeError in Llama-4-Maverick-17B-128E-Instruct-FP8 Resolved with Workaround #38283

@pchu2025

Description

@pchu2025

Issue Description

Loading meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 with transformers==4.53.0.dev0 or 4.51.0 initially failed with TypeError: argument of type 'NoneType' is not iterable in transformers.modeling_utils.post_init. This occurred with Llama4ForConditionalGeneration and AutoModelForCausalLM on a 4x NVIDIA RTX A6000 setup (~196GB VRAM, CUDA 12.4, Python 3.12.3, Ubuntu 24.04.2). A similar TypeError was seen previously with 2x GPUs. The issue was resolved using transformers==4.51.0 and config={"parallel_style": "none"}.

Steps to Reproduce

  1. Install dependencies:

    pip install torch==2.4.1 torchvision==0.19.1 accelerate==1.7.0 compressed-tensors==0.9.4 transformers==4.51.0
    
  2. Confirm model files (~389GB, 84 .safetensors) at /mnt/data/ai_super_palace/models/llama4/.

  3. Run (failed):
    from transformers import Llama4ForConditionalGeneration
    model = Llama4ForConditionalGeneration.from_pretrained(
    '/mnt/data/ai_super_palace/models/llama4',
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True
    )
    print('Model OK')

  4. Error:
    TypeError: argument of type 'NoneType' is not iterable

5: Workaround (succeeded):
import os
import torch
os.environ["TORCHVISION_DISABLE_NMS"] = "1"
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'/mnt/data/ai_super_palace/models/llama4',
torch_dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True,
offload_folder="/mnt/data/ai_super_palace/models/llama4/offload",
config={"parallel_style": "none"}
)
print('Model OK')

Environment
Transformers: 4.51.0 (or 4.53.0.dev0, commit b369a65)
Python: 3.12.3
PyTorch: 2.4.1
CUDA: 12.4
Accelerate: 1.7.0
Compressed-tensors: 0.9.4
OS: Ubuntu 24.04.2 LTS
Hardware: 4x NVIDIA RTX A6000 (~196GB VRAM)
Model: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
Additional Details
Model card requires transformers>=4.51.0, supports FP8 via compressed-tensors.
Previous errors: NameError: init_empty_weights, ValueError: requires accelerate, ImportError: requires compressed-tensors.
Config check (cat config.json | grep -E "parallel|tp_plan") empty.
Warnings: Uninitialized weights (feed_forward.experts.*) and offloaded parameters (expected due to VRAM).
Request
Is the TypeError a known llama4 bug? Should parallel_style be explicitly set in config.json?
Guidance on fine-tuning for sentiment analysis (~100–150GB/day, ~85–90% accuracy).
Sharing workaround for community benefit.
Logs
See traceback above. config.json (40KB) available if needed.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions