[OPT/Galactica] Load large galactica models#20390
[OPT/Galactica] Load large galactica models#20390younesbelkada merged 2 commits intohuggingface:mainfrom
galactica models#20390Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
I am not 100% sure if this is the approach we want to have, despite I can understand the intention. Would like to hear from @sgugger. For reference, |
|
Yes, I think it was a mistake from our side. We should either port a new model (with controlable bias and layer norm) and remove the |
sgugger
left a comment
There was a problem hiding this comment.
Those two changes go against the general philosophy of Transformers (not being a modular toolbox). The test such a change usually has to pass to be accepted is: "does it work with the existing canonical checkpoint for the model", which is not the case here.
However, I could consider allowing an exception here, mainly because the change does not pollute any of the forward methods with code that doesn't benefit the "canonical" OPT. Since it's tangential though, I'd like to hear from @patrickvonplaten and @LysandreJik to make sure we are aligned.
|
Thanks! In case we don't merge this PR we may be want to add |
bias termsgalactica models
|
I understood, and yes, that will be the alternative if this PR is declined :-) |
|
Hi @sgugger and @younesbelkada, it's one of the Galactica authors here. We think that there might be something wrong with the 30bn model specifically on HuggingFace. We're currently migrating our galai library to use the huggingface model without our custom OPT config. There seems to have been a conversion process applied to our models to give null weights to the biases (or something else similar to the OPT models), but specifically not on the 30bn file. Hopefully, this can be resolved without a PR by fixing the model file. See the great investigations done by @Jackmin801 on this ticket paperswithcode/galai#37 (comment) |
Hi @ydshieh, the |
|
Hi @AnthonyHartshorn |
|
As @sgugger said above, this goes very clearly against the foundation of However, fixing this any other way would result in some setups breaking in the wild; it would require us to update the architecture name to Given that, I'm also inclined to accept this change even if it goes against our design decisions. If we could do it all over again however, I would heavily push for a new model architecture. |
|
Thanks @LysandreJik for approving this PR. I have another related question. As pointed by Jackmin801 in the comment linked above by Anthony (paperswithcode/galai#37 (comment)), almost all of the checkpoints were converted from our float16 checkpoints and uploaded to the hub in full float32 precision (except for 30B which is an exact copy). That's not the best for user experience: download time, disk usage and loading time doubles for no benefit. I wonder if we can fix it, there are couple options I see:
What do you think? I'm in favor of the second option as it's the best for backward compatibility and user experience. |
|
PyTorch automatically converts checkpoint weights to the dtype of the model when you load the state_dict, so option 2 is actually 100% backward compatible. |
|
Thanks @sgugger, I missed the fact that |
|
This PR is ok for me - galactica is build on top of OPT so one could fine-tune OPT using these two configs => so this PR is def ok for me |
|
Thanks everyone! |
|
@younesbelkada I'm not a member of the org yet. I've verified my work email address, but wasn't auto-added. How can I learn who are the admins? |
|
@patrickvonplaten can you add me to https://huggingface.co/facebook (same username)? |
|
I can merge the PR if this is the only thing needed! 🤗 |
|
Thanks @ArthurZucker. I was working on providing float16 weights in the backward compatible way as discussed above. I think it's best to just fix all the checkpoints to make them float16 and keep zero biases for backward compatibility with HF 4.21.0-4.24.0. I'm in the middle of preparing a new HF hub PR for this, I'll let you know in case I won't be able to merge it. @sgugger |
|
I think this is expected as if you want to load a model natively without using |
|
Mmm, no. If it is indeed the case then it's a bug. Do you have a small reproducer/repo ID I could look at? |
|
This is what I used: import torch
from transformers import OPTForCausalLM
for device_map in [None, "auto"]:
for dtype in [None, torch.float16, torch.float32]:
model = OPTForCausalLM.from_pretrained(
"facebook/galactica-125m",
revision="refs/pr/6",
torch_dtype=dtype,
device_map=device_map
)
print(f"[device_map={device_map}]: {dtype} -> {model.lm_head.weight.dtype}")
print()What I get for For |
|
PRs replacing the existing https://huggingface.co/facebook/galactica-125m/discussions/6 |
|
Found the issue. The PR mentioned above should make the result consistent between |
* fix `opt` bias * revert unneeded assignment
What does this PR do?
This PR fixes a small bug on
OPT. Before, thebiasterm was always set toTrue- leading to some external implementations to hardcode it if they wanted to train an OPT model without bias terms. See for example here. This PR aims to give more control on whether we should use or notbiasterms onLinearlayers of OPT.The PR also fixes the same issue with
nn.LayerNorm. Some derivatives of OPT does not use learnable parameters for layer norm's weights and biases (ie, setelementwise_affinetoFalse), therefore avoids having hardcoded hacks in the future.This PR should not be a breaking change as the default values of these booleans are set to
True(as we were doing nothing)This PR should also fix: https://huggingface.co/facebook/galactica-30b/discussions/4 (ofc, after updating the relevant config files)
cc @sgugger @ydshieh @mrm8488
All slow tests pass