AttributeError: 'Parameter' object has no attribute 'ds_numel'

### System Info

Python 3.10
CUDA 11.8
torch 2.0.1
transfromers 4.30.2
bitsandbytes 0.39.1
datasets 2.13.0
einops 0.6.1
trl 0.4.4
accelerate 0.20.3
deepspeed 0.9.5

### Who can help?

@pacman100 

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Hi,

I'm trying to reproduce the Falcon LLM fine-tuning by using a modified version of the [HF Collab script](https://colab.research.google.com/drive/1BiQiw31DT7-cDp1-0ySXvvhzqomTdI-o?usp=sharing). 

The Jupyter notebook runs well when DeepSpeed is not in the mix, but when I introduce the DeepSpeed ZeRO-3 in `TrainingArguments` (which gets fed into `SFTTrainer` the `trainer.train()` call fails with error `AttributeError: 'Parameter' object has no attribute 'ds_numel'.`

**Here the DeepSpeed config `dict` I'm using:**

```
ds_config = {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": "true"
        },
        "offload_param": {
            "device": "none",
            "pin_memory": "true"
        },
        "overlap_comm": "true",
        "contiguous_gradients": "true",
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "true"
    },

    "gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
    "gradient_clipping": "auto",
    "steps_per_print": 10,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": "false"
```


**Stack trace:
```
File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py:1793, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1791 logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
   1792 logger.info(f"  Total optimization steps = {max_steps:,}")
-> 1793 logger.info(f"  Number of trainable parameters = {get_model_param_count(model, trainable_only=True):,}")
   1795 self.state.epoch = 0
   1796 start_time = time.time()

File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1053, in get_model_param_count(model, trainable_only)
   1050     def numel(p):
   1051         return p.numel()
-> 1053 return sum(numel(p) for p in model.parameters() if not trainable_only or p.requires_grad)

File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1053, in <genexpr>(.0)
   1050     def numel(p):
   1051         return p.numel()
-> 1053 return sum(numel(p) for p in model.parameters() if not trainable_only or p.requires_grad)

File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1046, in get_model_param_count.<locals>.numel(p)
   1045 def numel(p):
-> 1046     return p.ds_numel

AttributeError: 'Parameter' object has no attribute 'ds_numel'
```

**Here the core section of the code**


```
# Dataset loader
DATASET_PATH = "timdettmers/openassistant-guanaco"

# Params for AutoModelForCausalLM
DEVICE_MAP = "auto" # Instructs Accelerate to use all GPUs available in the node.
LOAD_IN_8BIT = True # 8-bit precision requires ~ 1.2-1.4GB memory per 1B parameters
MODEL_NAME = "tiiuae/falcon-7b" # Could use "tiiuae/falcon-40b" or "tiiuae/falcon-7b"
TRUST_REMOTE_CODE = True # Required when a model is not yet part of the Transformers library

# LoRA configuration (see https://huggingface.co/docs/peft/conceptual_guides/lora)
# LoRA allows efficient fine-tuning of LLMs by training low rank (small) matrices
LORA_ALPHA = 16 # LoRA scaling factor.
LORA_DROPOUT = 0.1 # Probability of a neuron link to get disabled during a step
LORA_R = 32 # Rank of update matrices. Lower rank results in smaller update matrices with fewer trainable parameters.
#List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.
LORA_TARGET_MODULES = ["query_key_value",
                  "dense",
                  "dense_h_to_4h",
                  "dense_4h_to_h"]

# Trainer configuration
BF16 = True #  Whether to use bf16 precision. Requires Ampere or higher NVIDIA architecture.
EVAL_STEPS = 8 # Number of update steps between two evaluations if evaluation_strategy="steps"
EVAL_STRATEGY = 'steps' # Evaluation is done (and logged) every eval_steps.
FP16 = not BF16 # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
GRADIENT_ACCUMULATION_STEPS = 4 # Accumulates gradients from 'n' batches before stepping the optimizer
GROUP_BY_LENGTH = True # group samples of similar length to minimize padding and be more efficient.
LOAD_BEST = True # Load the checkpoint with the lowest loss at the end.
LOGGING_STEPS = 4 # Number of update steps between two logs if logging_strategy="steps".
LOGGING_STRATEGY = 'steps' # Logging is done every logging_steps
LR = 2e-4 # The initial learning rate.
LR_SCHEDULER_TYPE = 'constant' # Other options are 'cosine' or 'linear'
MAX_GRAD_NORM = 0.3 # Maximum gradient norm (for gradient clipping).
MAX_STEPS = 184 # Start with a small test (64) then increase the number to multiple epochs
OPTIMIZER = "paged_adamw_32bit" # Optimizer function
OUTPUT_DIR = "./results" # Where checkpoints will be saved
PER_DEV_TRAIN_BATCH_SIZE = 4 # Use a low number if getting out of memory errors
REPORT_ENDPOINT = "wandb" # Comment out if don't want to use wandb. Ensure you had run 'wandb login' previously.
SAVE_STEPS = 8 # Number of updates steps before two checkpoint saves if save_strategy="steps"
SAVE_STRATEGY = 'steps' # Save is done every save_steps.
SAVE_TOTAL_LIMIT = 2 # Only save the last and the best checkpoints
USE_CACHE = False # Can't use cache with gradient check pointing
WARMUP_RATIO = 0.03 # Ratio of total training steps used for a linear warmup from 0 to learning_rate.
WEIGHT_DECAY = 0.001 # AdamW regularization parameter

# SFTTrainer config (see https://huggingface.co/docs/trl/main/en/sft_trainer)
MAX_SEQ_LENGTH = 512 # Max length is token sequence in an example

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_8bit = LOAD_IN_8BIT,
    trust_remote_code = TRUST_REMOTE_CODE,
    device_map = DEVICE_MAP,
)
model.config.use_cache = USE_CACHE

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,
                                          trust_remote_code = TRUST_REMOTE_CODE)
tokenizer.pad_token = tokenizer.eos_token

# Setup LoRA
peft_config = LoraConfig(
    lora_alpha = LORA_ALPHA,
    lora_dropout = LORA_DROPOUT,
    r = LORA_R,
    bias = "none",
    task_type = "CAUSAL_LM",
    target_modules = LORA_TARGET_MODULES
)

# Setup training arguments
training_arguments = TrainingArguments(
    output_dir = OUTPUT_DIR,
    per_device_train_batch_size = PER_DEV_TRAIN_BATCH_SIZE,
    gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
    #optim = OPTIMIZER,
    save_steps = SAVE_STEPS,
    save_strategy = SAVE_STRATEGY,
    logging_steps = LOGGING_STEPS,
    logging_strategy = LOGGING_STRATEGY,
    learning_rate = LR,
    #lr_scheduler_type = LR_SCHEDULER_TYPE,
    fp16 = FP16,
    bf16 = BF16,
    max_grad_norm = MAX_GRAD_NORM,
    max_steps = MAX_STEPS,
    warmup_ratio = WARMUP_RATIO,
    group_by_length = GROUP_BY_LENGTH,
    report_to = REPORT_ENDPOINT,
    evaluation_strategy = EVAL_STRATEGY,
    eval_steps = EVAL_STEPS,
    load_best_model_at_end = LOAD_BEST,
    greater_is_better = False,
    save_total_limit  = SAVE_TOTAL_LIMIT,
    deepspeed=ds_config,
    disable_tqdm=True,
    #log_level= "error",
)

trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    peft_config = peft_config,
    dataset_text_field = "text",
    max_seq_length = MAX_SEQ_LENGTH,
    tokenizer = tokenizer,
    args = training_arguments,

)

for name, module in trainer.model.named_modules():

    if "norm" in name:
        module = module.to(torch.float32)
  

# Fine-tune the  model
trainer.train()

```

Thanks!

### Expected behavior

I expected the training process to run with DeepSpeed in the mix as it was doing when it DS wasn't called. 

Thanks in advance for your help!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'Parameter' object has no attribute 'ds_numel' #24792

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AttributeError: 'Parameter' object has no attribute 'ds_numel' #24792

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions