-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Description
System Info
Python 3.10
CUDA 11.8
torch 2.0.1
transfromers 4.30.2
bitsandbytes 0.39.1
datasets 2.13.0
einops 0.6.1
trl 0.4.4
accelerate 0.20.3
deepspeed 0.9.5
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Hi,
I'm trying to reproduce the Falcon LLM fine-tuning by using a modified version of the HF Collab script.
The Jupyter notebook runs well when DeepSpeed is not in the mix, but when I introduce the DeepSpeed ZeRO-3 in TrainingArguments (which gets fed into SFTTrainer the trainer.train() call fails with error AttributeError: 'Parameter' object has no attribute 'ds_numel'.
Here the DeepSpeed config dict I'm using:
ds_config = {
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": "true"
},
"offload_param": {
"device": "none",
"pin_memory": "true"
},
"overlap_comm": "true",
"contiguous_gradients": "true",
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": "true"
},
"gradient_accumulation_steps": GRADIENT_ACCUMULATION_STEPS,
"gradient_clipping": "auto",
"steps_per_print": 10,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": "false"
**Stack trace:
File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py:1793, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1791 logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
1792 logger.info(f" Total optimization steps = {max_steps:,}")
-> 1793 logger.info(f" Number of trainable parameters = {get_model_param_count(model, trainable_only=True):,}")
1795 self.state.epoch = 0
1796 start_time = time.time()
File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1053, in get_model_param_count(model, trainable_only)
1050 def numel(p):
1051 return p.numel()
-> 1053 return sum(numel(p) for p in model.parameters() if not trainable_only or p.requires_grad)
File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1053, in <genexpr>(.0)
1050 def numel(p):
1051 return p.numel()
-> 1053 return sum(numel(p) for p in model.parameters() if not trainable_only or p.requires_grad)
File ~/miniconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer_pt_utils.py:1046, in get_model_param_count.<locals>.numel(p)
1045 def numel(p):
-> 1046 return p.ds_numel
AttributeError: 'Parameter' object has no attribute 'ds_numel'
Here the core section of the code
# Dataset loader
DATASET_PATH = "timdettmers/openassistant-guanaco"
# Params for AutoModelForCausalLM
DEVICE_MAP = "auto" # Instructs Accelerate to use all GPUs available in the node.
LOAD_IN_8BIT = True # 8-bit precision requires ~ 1.2-1.4GB memory per 1B parameters
MODEL_NAME = "tiiuae/falcon-7b" # Could use "tiiuae/falcon-40b" or "tiiuae/falcon-7b"
TRUST_REMOTE_CODE = True # Required when a model is not yet part of the Transformers library
# LoRA configuration (see https://huggingface.co/docs/peft/conceptual_guides/lora)
# LoRA allows efficient fine-tuning of LLMs by training low rank (small) matrices
LORA_ALPHA = 16 # LoRA scaling factor.
LORA_DROPOUT = 0.1 # Probability of a neuron link to get disabled during a step
LORA_R = 32 # Rank of update matrices. Lower rank results in smaller update matrices with fewer trainable parameters.
#List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.
LORA_TARGET_MODULES = ["query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h"]
# Trainer configuration
BF16 = True # Whether to use bf16 precision. Requires Ampere or higher NVIDIA architecture.
EVAL_STEPS = 8 # Number of update steps between two evaluations if evaluation_strategy="steps"
EVAL_STRATEGY = 'steps' # Evaluation is done (and logged) every eval_steps.
FP16 = not BF16 # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
GRADIENT_ACCUMULATION_STEPS = 4 # Accumulates gradients from 'n' batches before stepping the optimizer
GROUP_BY_LENGTH = True # group samples of similar length to minimize padding and be more efficient.
LOAD_BEST = True # Load the checkpoint with the lowest loss at the end.
LOGGING_STEPS = 4 # Number of update steps between two logs if logging_strategy="steps".
LOGGING_STRATEGY = 'steps' # Logging is done every logging_steps
LR = 2e-4 # The initial learning rate.
LR_SCHEDULER_TYPE = 'constant' # Other options are 'cosine' or 'linear'
MAX_GRAD_NORM = 0.3 # Maximum gradient norm (for gradient clipping).
MAX_STEPS = 184 # Start with a small test (64) then increase the number to multiple epochs
OPTIMIZER = "paged_adamw_32bit" # Optimizer function
OUTPUT_DIR = "./results" # Where checkpoints will be saved
PER_DEV_TRAIN_BATCH_SIZE = 4 # Use a low number if getting out of memory errors
REPORT_ENDPOINT = "wandb" # Comment out if don't want to use wandb. Ensure you had run 'wandb login' previously.
SAVE_STEPS = 8 # Number of updates steps before two checkpoint saves if save_strategy="steps"
SAVE_STRATEGY = 'steps' # Save is done every save_steps.
SAVE_TOTAL_LIMIT = 2 # Only save the last and the best checkpoints
USE_CACHE = False # Can't use cache with gradient check pointing
WARMUP_RATIO = 0.03 # Ratio of total training steps used for a linear warmup from 0 to learning_rate.
WEIGHT_DECAY = 0.001 # AdamW regularization parameter
# SFTTrainer config (see https://huggingface.co/docs/trl/main/en/sft_trainer)
MAX_SEQ_LENGTH = 512 # Max length is token sequence in an example
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
load_in_8bit = LOAD_IN_8BIT,
trust_remote_code = TRUST_REMOTE_CODE,
device_map = DEVICE_MAP,
)
model.config.use_cache = USE_CACHE
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,
trust_remote_code = TRUST_REMOTE_CODE)
tokenizer.pad_token = tokenizer.eos_token
# Setup LoRA
peft_config = LoraConfig(
lora_alpha = LORA_ALPHA,
lora_dropout = LORA_DROPOUT,
r = LORA_R,
bias = "none",
task_type = "CAUSAL_LM",
target_modules = LORA_TARGET_MODULES
)
# Setup training arguments
training_arguments = TrainingArguments(
output_dir = OUTPUT_DIR,
per_device_train_batch_size = PER_DEV_TRAIN_BATCH_SIZE,
gradient_accumulation_steps = GRADIENT_ACCUMULATION_STEPS,
#optim = OPTIMIZER,
save_steps = SAVE_STEPS,
save_strategy = SAVE_STRATEGY,
logging_steps = LOGGING_STEPS,
logging_strategy = LOGGING_STRATEGY,
learning_rate = LR,
#lr_scheduler_type = LR_SCHEDULER_TYPE,
fp16 = FP16,
bf16 = BF16,
max_grad_norm = MAX_GRAD_NORM,
max_steps = MAX_STEPS,
warmup_ratio = WARMUP_RATIO,
group_by_length = GROUP_BY_LENGTH,
report_to = REPORT_ENDPOINT,
evaluation_strategy = EVAL_STRATEGY,
eval_steps = EVAL_STEPS,
load_best_model_at_end = LOAD_BEST,
greater_is_better = False,
save_total_limit = SAVE_TOTAL_LIMIT,
deepspeed=ds_config,
disable_tqdm=True,
#log_level= "error",
)
trainer = SFTTrainer(
model = model,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
peft_config = peft_config,
dataset_text_field = "text",
max_seq_length = MAX_SEQ_LENGTH,
tokenizer = tokenizer,
args = training_arguments,
)
for name, module in trainer.model.named_modules():
if "norm" in name:
module = module.to(torch.float32)
# Fine-tune the model
trainer.train()
Thanks!
Expected behavior
I expected the training process to run with DeepSpeed in the mix as it was doing when it DS wasn't called.
Thanks in advance for your help!