Skip to content

[OPT/Galactica] Load large galactica models#20390

Merged
younesbelkada merged 2 commits intohuggingface:mainfrom
younesbelkada:fix-opt-bias
Nov 30, 2022
Merged

[OPT/Galactica] Load large galactica models#20390
younesbelkada merged 2 commits intohuggingface:mainfrom
younesbelkada:fix-opt-bias

Conversation

@younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Nov 22, 2022

What does this PR do?

This PR fixes a small bug on OPT. Before, the bias term was always set to True - leading to some external implementations to hardcode it if they wanted to train an OPT model without bias terms. See for example here. This PR aims to give more control on whether we should use or not bias terms on Linear layers of OPT.
The PR also fixes the same issue with nn.LayerNorm. Some derivatives of OPT does not use learnable parameters for layer norm's weights and biases (ie, set elementwise_affine to False), therefore avoids having hardcoded hacks in the future.
This PR should not be a breaking change as the default values of these booleans are set to True (as we were doing nothing)

This PR should also fix: https://huggingface.co/facebook/galactica-30b/discussions/4 (ofc, after updating the relevant config files)

cc @sgugger @ydshieh @mrm8488

All slow tests pass

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 22, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 22, 2022

I am not 100% sure if this is the approach we want to have, despite I can understand the intention. Would like to hear from @sgugger.

For reference, class OPTDecoderLayer from galai does pass bias to OPTAttention

https://github.com/paperswithcode/galai/blob/c1e16979c1748e7e823fe96da941d6df60f1006b/galai/architecture.py#L280

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Nov 22, 2022

Yes, I think it was a mistake from our side. We should either port a new model (with controlable bias and layer norm) and remove the bias boolean from OPTAttention as it is always set to True or go with this fix

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two changes go against the general philosophy of Transformers (not being a modular toolbox). The test such a change usually has to pass to be accepted is: "does it work with the existing canonical checkpoint for the model", which is not the case here.

However, I could consider allowing an exception here, mainly because the change does not pollute any of the forward methods with code that doesn't benefit the "canonical" OPT. Since it's tangential though, I'd like to hear from @patrickvonplaten and @LysandreJik to make sure we are aligned.

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Nov 22, 2022

Thanks!
Sorry for the last minute clarification as I just realized that the description and title are not clear, but the main goal of this PR is to support loading and using large galatica models that uses OPT architecture, initially reported in: https://huggingface.co/facebook/galactica-30b/discussions/4 / therefore the title + description is slightly misleading
The snippet to reproduce:

import torch
from transformers import AutoTokenizer, OPTForCausalLM, AutoModel

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map="auto", torch_dtpye=torch.float16)

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

In case we don't merge this PR we may be want to add galatica as a separate new architecture as some galactica models (such as 30b) does not use bias on linear layers and don't have any learnable weights on their LayerNorm

@younesbelkada younesbelkada changed the title [OPT] We should allow not using bias terms [OPT/Galactica] Load large galactica models Nov 22, 2022
@sgugger
Copy link
Collaborator

sgugger commented Nov 22, 2022

I understood, and yes, that will be the alternative if this PR is declined :-)

@AnthonyHartshorn
Copy link

Hi @sgugger and @younesbelkada, it's one of the Galactica authors here. We think that there might be something wrong with the 30bn model specifically on HuggingFace. We're currently migrating our galai library to use the huggingface model without our custom OPT config. There seems to have been a conversion process applied to our models to give null weights to the biases (or something else similar to the OPT models), but specifically not on the 30bn file. Hopefully, this can be resolved without a PR by fixing the model file. See the great investigations done by @Jackmin801 on this ticket paperswithcode/galai#37 (comment)

@mkardas
Copy link

mkardas commented Nov 25, 2022

For reference, class OPTDecoderLayer from galai does pass bias to OPTAttention

Hi @ydshieh, the bias flag is passed only so that Galactica extension of OPT architecture is backward compatible. We set all the additional config parameters to the values used by OPT (see https://github.com/paperswithcode/galai/blob/main/galai/config.py#L92-L95) so that OPT checkpoints work as before, but we set them accordingly in Galactica configs (see f.e., https://huggingface.co/mrm8488/galactica-125m/blob/main/config.json#L18). Whether these changes should be ported back to modeling_opt or the Galactica should be forked-out from it depends on how much it deviates from the general philosophy of Transformers as @sgugger noted.

@younesbelkada
Copy link
Contributor Author

Hi @AnthonyHartshorn
Thanks a lot for your message. Indeed, big kudos to @Jackmin801 for the investigation, his investigation in https://huggingface.co/facebook/galactica-30b/discussions/4#637e90571dbae0919104b582 helped me define the rootcause of the bug.
I guess it can be also fixed by saving zero bias and ones for the layer norms, updating the weights on the hub with the new ones can do the trick too yes.

@LysandreJik
Copy link
Member

As @sgugger said above, this goes very clearly against the foundation of transformers to add configurable parameters to a previous model architecture to support a new model architecture.

However, fixing this any other way would result in some setups breaking in the wild; it would require us to update the architecture name to galactica instead of opt which would break every existing setup that currently uses these models unless the upgrade to the latest version.

Given that, I'm also inclined to accept this change even if it goes against our design decisions. If we could do it all over again however, I would heavily push for a new model architecture.

@mkardas
Copy link

mkardas commented Nov 29, 2022

Thanks @LysandreJik for approving this PR. I have another related question. As pointed by Jackmin801 in the comment linked above by Anthony (paperswithcode/galai#37 (comment)), almost all of the checkpoints were converted from our float16 checkpoints and uploaded to the hub in full float32 precision (except for 30B which is an exact copy). That's not the best for user experience: download time, disk usage and loading time doubles for no benefit. I wonder if we can fix it, there are couple options I see:

  • upload our float16 checkpoints once this PR is merged. This would not be backward compatible as this PR is required,
  • do the same conversion that @mrm8488 did, but .half() the models before exporting. This would be almost backward compatible except for the case when a user doesn't specify pytorch_dtype when loading a model, as after that the models would load as float16 by default,
  • keep the existing checkpoints, potentially fix the 30B to be float32 as well for consistency (it wasn't working before this PR anyway). Not the best user experience,
  • add new checkpoints galactica-125m-fp16, ..., galactica-120b-fp16. Might be too confusing for users.

What do you think? I'm in favor of the second option as it's the best for backward compatibility and user experience.

@sgugger
Copy link
Collaborator

sgugger commented Nov 29, 2022

PyTorch automatically converts checkpoint weights to the dtype of the model when you load the state_dict, so option 2 is actually 100% backward compatible.

@mkardas
Copy link

mkardas commented Nov 29, 2022

Thanks @sgugger, I missed the fact that torch_dtype is part of config.json.

@patrickvonplaten
Copy link
Contributor

This PR is ok for me - galactica is build on top of OPT so one could fine-tune OPT using these two configs => so this PR is def ok for me

@younesbelkada younesbelkada merged commit b75255c into huggingface:main Nov 30, 2022
@younesbelkada
Copy link
Contributor Author

Thanks everyone!
@mkardas @mrm8488 : https://huggingface.co/facebook/galactica-30b/discussions/5 since now this PR has been merged, can you merge this PR to fix the initial issue for 30b ?

@mkardas
Copy link

mkardas commented Dec 1, 2022

@younesbelkada I'm not a member of the org yet. I've verified my work email address, but wasn't auto-added. How can I learn who are the admins?

@mkardas
Copy link

mkardas commented Dec 2, 2022

@patrickvonplaten can you add me to https://huggingface.co/facebook (same username)?

@ArthurZucker
Copy link
Collaborator

I can merge the PR if this is the only thing needed! 🤗

@mkardas
Copy link

mkardas commented Dec 5, 2022

Thanks @ArthurZucker. I was working on providing float16 weights in the backward compatible way as discussed above. I think it's best to just fix all the checkpoints to make them float16 and keep zero biases for backward compatibility with HF 4.21.0-4.24.0. I'm in the middle of preparing a new HF hub PR for this, I'll let you know in case I won't be able to merge it.

@sgugger
From my tests on backward compatibility, it seems that calling OPTForCausalLM.from_pretrained with torch_dtype=None, device_map=None results in float32 weights regardless of what's in the checkpoint bin files and config.json. However, torch_dtype=None, device_map="auto" results in the same weights type as in the checkpoint bin files, regardless of config.json. Is it to be expected?

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Dec 5, 2022

I think this is expected as if you want to load a model natively without using accelerate (i.e. without adding device_map="auto"), transformers will automatically load the weights in fp32, in this case whenever you want to load a model with its native dtype of the weights you need to use torch_dtype="auto".

@sgugger
Copy link
Collaborator

sgugger commented Dec 5, 2022

Mmm, no. If it is indeed the case then it's a bug. Do you have a small reproducer/repo ID I could look at?

@mkardas
Copy link

mkardas commented Dec 5, 2022

This is what I used:

import torch
from transformers import OPTForCausalLM

for device_map in [None, "auto"]:
    for dtype in [None, torch.float16, torch.float32]:
        model = OPTForCausalLM.from_pretrained(
            "facebook/galactica-125m",
            revision="refs/pr/6",
            torch_dtype=dtype,
            device_map=device_map
        )
        print(f"[device_map={device_map}]: {dtype} -> {model.lm_head.weight.dtype}")
    print()

What I get for refs/pr/6 (which has torch_dtype=float32 in config.json and float16 bin files):

[device_map=None]: None -> torch.float32
[device_map=None]: torch.float16 -> torch.float16
[device_map=None]: torch.float32 -> torch.float32

[device_map=auto]: None -> torch.float16
[device_map=auto]: torch.float16 -> torch.float16
[device_map=auto]: torch.float32 -> torch.float32

For facebook/opt-125m the output is the same, even though opt-125m has float16 both in config.json and bin files.

@sgugger
Copy link
Collaborator

sgugger commented Dec 5, 2022

Found the issue. The PR mentioned above should make the result consistent between device_map=None and device_map="auto".

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022
* fix `opt` bias

* revert unneeded assignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants