llm-course icon indicating copy to clipboard operation
llm-course copied to clipboard

New colab : Fine-tune LLMs with Axolotl End-to-end guide to the state-of-the-art tool for fine-tuning

Open g-i-o-r-g-i-o opened this issue 2 years ago • 5 comments

Hi, I've uploaded colab that follows your article.

The Merge operation is missing, because I didn't know if you were interested.

Downloads.zip

g-i-o-r-g-i-o avatar Jan 26 '24 12:01 g-i-o-r-g-i-o

Thanks, I completed it and added it to the LLM course, see https://colab.research.google.com/drive/1Xu0BrCB7IShwSWKVcfAfhehwjDrDMH5m?usp=sharing

mlabonne avatar Jan 27 '24 22:01 mlabonne

If this line of code:

!pip install -qqq -e '.[flash-attn,deepspeed]' --progress-bar off

gives you an error, you should downgrade Torch to version 2.1.1:

!pip install torch==2.1.1

kukedlc87 avatar Feb 15 '24 14:02 kukedlc87

Thanks @kukedlc87 I added it

mlabonne avatar Feb 18 '24 22:02 mlabonne

@mlabonne the Fine_tune_LLMs_with_Axolotl.ipynb does not work. Those are the dependencies

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.28.0         
        peft: 0.10.0         
transformers: 4.40.0.dev0    
         trl: 0.8.5          
       torch: 2.2.1+cu121    
bitsandbytes: 0.43.0         
****************************************

Training is failing on Colab T4 with RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'. This is the full stacktrace

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/axolotl/src/axolotl/cli/train.py", line 59, in <module>
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/content/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/content/axolotl/src/axolotl/cli/train.py", line 55, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/content/axolotl/src/axolotl/train.py", line 170, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1837, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2227, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2145, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2095, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 277, in _unscale_grads_
    torch._amp_foreach_non_finite_check_and_unscale_(
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
  0% 0/20 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'axolotl.cli.train', 'config.yaml']' returned non-zero exit status 1.

Also you need to remove mlflow reporting from the config otherwise it will complain as it is not installed.

dzlab avatar May 05 '24 09:05 dzlab

Thanks I upgraded PyTorch's version and removed mlflow.

On Sun, May 5, 2024 at 11:44 AM bachr @.***> wrote:

@mlabonne https://github.com/mlabonne the Fine_tune_LLMs_with_Axolotl.ipynb https://colab.research.google.com/drive/1Xu0BrCB7IShwSWKVcfAfhehwjDrDMH5m?usp=sharing does not work. Those are the dependencies


**** Axolotl Dependency Versions ***** accelerate: 0.28.0 peft: 0.10.0 transformers: 4.40.0.dev0 trl: 0.8.5 torch: 2.2.1+cu121 bitsandbytes: 0.43.0


Training is failing on Colab T4 with RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'. This is the full stacktrace

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/content/axolotl/src/axolotl/cli/train.py", line 59, in fire.Fire(do_cli) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire component, remaining_args = CallAndUpdateTrace( File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/content/axolotl/src/axolotl/cli/train.py", line 35, in do_cli return do_train(parsed_cfg, parsed_cli_args) File "/content/axolotl/src/axolotl/cli/train.py", line 55, in do_train return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta) File "/content/axolotl/src/axolotl/train.py", line 170, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1837, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2227, in inner_training_loop grad_norm = self.accelerator.clip_grad_norm( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2145, in clip_grad_norm self.unscale_gradients() File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2095, in unscale_gradients self.scaler.unscale(opt) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale optimizer_state["found_inf_per_device"] = self.unscale_grads( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 277, in unscale_grads torch.amp_foreach_non_finite_check_and_unscale( RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16' 0% 0/20 [00:02<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1057, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 673, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'axolotl.cli.train', 'config.yaml']' returned non-zero exit status 1.

Also you need to remove mlflow reporting from the config otherwise it will complain as it is not installed.

— Reply to this email directly, view it on GitHub https://github.com/mlabonne/llm-course/issues/40#issuecomment-2094714353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATL5EGX4MMOEDGVKGW6OWQ3ZAX5QTAVCNFSM6AAAAABCMCUZT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUG4YTIMZVGM . You are receiving this because you were mentioned.Message ID: @.***>

mlabonne avatar May 05 '24 21:05 mlabonne