Skip to content

[Llama2] Add disabling TP behavior#728

Merged
younesbelkada merged 3 commits intohuggingface:mainfrom
younesbelkada:fix-llama-tp
Jul 19, 2023
Merged

[Llama2] Add disabling TP behavior#728
younesbelkada merged 3 commits intohuggingface:mainfrom
younesbelkada:fix-llama-tp

Conversation

@younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Jul 19, 2023

Fixes #726

This PR is on par with huggingface/transformers#24906

Currently the TP paradigm that is supported by transformers, technically is not really a real Tensor Parallelism paradigm but rather a simulation of TP by manually splitting the layers into chunks and concatenating the results.

The motivation of that implementation is to mimic the TP paradigm that was used during pre-training of these models, as slicing weight tensors and input leads to slight numerical differences: pytorch/pytorch#76232

I would argue that this might be not that important for training as the model will be fine-tuned, thus the weights of the model will be adapted accordingly.

cc @pacman100 @BenjaminBossan

I propose to properly support TP in the future once this is properly implemented, currently the TP that is in place is more a patch to match the logits of the original implementation

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 19, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @younesbelkada for the quick fix, LGTM!

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds reasonable to me that for fine-tuning, TP is disabled, especially if it is just simulated. I wonder if this should be documented somewhere, since, as you mentioned, it can lead to small numerical differences. Perhaps a comment above these new lines of code?

@younesbelkada
Copy link
Contributor Author

Perhaps a comment above these new lines of code?

Sure yes, will add this !

@younesbelkada younesbelkada merged commit a09f66c into huggingface:main Jul 19, 2023
@younesbelkada younesbelkada deleted the fix-llama-tp branch July 19, 2023 12:29
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
* add disabling TP behavior

* add comments

* adapt from new changes of transformers PR
cyyever pushed a commit to cyyever/peft that referenced this pull request Sep 4, 2025
* update to `prepare_model_for_kbit_training`

from deprecated `prepare_model_for_int8_training`
and add `use_gradient_checkpointing=args.gradient_checkpointing` to
automatically follow the gradient checkpointing choice

is also the workaround for huggingface#694

* workaround for gradient checkpointing issue

calling model.gradient_checkpointing_enable() twice causes issues
this workaround calls it in prepare_model_for_kbit_training and then
changes the arg to false to make sure it isn't called again in
huggingface trainer inner loop

also changes stack_llama_2 sft trainer to use correct device map for ddp
training so that you can test this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Tensor Parallelism, which is used in LLaMA-2

4 participants