[OPT/Galactica] Load large `galactica` models by younesbelkada · Pull Request #20390 · huggingface/transformers

younesbelkada · 2022-11-22T17:01:29Z

What does this PR do?

This PR fixes a small bug on OPT. Before, the bias term was always set to True - leading to some external implementations to hardcode it if they wanted to train an OPT model without bias terms. See for example here. This PR aims to give more control on whether we should use or not bias terms on Linear layers of OPT.
The PR also fixes the same issue with nn.LayerNorm. Some derivatives of OPT does not use learnable parameters for layer norm's weights and biases (ie, set elementwise_affine to False), therefore avoids having hardcoded hacks in the future.
This PR should not be a breaking change as the default values of these booleans are set to True (as we were doing nothing)

This PR should also fix: https://huggingface.co/facebook/galactica-30b/discussions/4 (ofc, after updating the relevant config files)

cc @sgugger @ydshieh @mrm8488

All slow tests pass

src/transformers/models/opt/modeling_opt.py

HuggingFaceDocBuilderDev · 2022-11-22T17:15:04Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2022-11-22T17:17:07Z

I am not 100% sure if this is the approach we want to have, despite I can understand the intention. Would like to hear from @sgugger.

For reference, class OPTDecoderLayer from galai does pass bias to OPTAttention

https://github.com/paperswithcode/galai/blob/c1e16979c1748e7e823fe96da941d6df60f1006b/galai/architecture.py#L280

younesbelkada · 2022-11-22T17:44:07Z

Yes, I think it was a mistake from our side. We should either port a new model (with controlable bias and layer norm) and remove the bias boolean from OPTAttention as it is always set to True or go with this fix

sgugger

Those two changes go against the general philosophy of Transformers (not being a modular toolbox). The test such a change usually has to pass to be accepted is: "does it work with the existing canonical checkpoint for the model", which is not the case here.

However, I could consider allowing an exception here, mainly because the change does not pollute any of the forward methods with code that doesn't benefit the "canonical" OPT. Since it's tangential though, I'd like to hear from @patrickvonplaten and @LysandreJik to make sure we are aligned.

younesbelkada · 2022-11-22T18:18:51Z

Thanks!
Sorry for the last minute clarification as I just realized that the description and title are not clear, but the main goal of this PR is to support loading and using large galatica models that uses OPT architecture, initially reported in: https://huggingface.co/facebook/galactica-30b/discussions/4 / therefore the title + description is slightly misleading
The snippet to reproduce:

import torch
from transformers import AutoTokenizer, OPTForCausalLM, AutoModel

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map="auto", torch_dtpye=torch.float16)

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

In case we don't merge this PR we may be want to add galatica as a separate new architecture as some galactica models (such as 30b) does not use bias on linear layers and don't have any learnable weights on their LayerNorm

sgugger · 2022-11-22T18:21:49Z

I understood, and yes, that will be the alternative if this PR is declined :-)

AnthonyHartshorn · 2022-11-25T12:02:20Z

Hi @sgugger and @younesbelkada, it's one of the Galactica authors here. We think that there might be something wrong with the 30bn model specifically on HuggingFace. We're currently migrating our galai library to use the huggingface model without our custom OPT config. There seems to have been a conversion process applied to our models to give null weights to the biases (or something else similar to the OPT models), but specifically not on the 30bn file. Hopefully, this can be resolved without a PR by fixing the model file. See the great investigations done by @Jackmin801 on this ticket paperswithcode/galai#37 (comment)

mkardas · 2022-11-25T13:52:41Z

For reference, class OPTDecoderLayer from galai does pass bias to OPTAttention

Hi @ydshieh, the bias flag is passed only so that Galactica extension of OPT architecture is backward compatible. We set all the additional config parameters to the values used by OPT (see https://github.com/paperswithcode/galai/blob/main/galai/config.py#L92-L95) so that OPT checkpoints work as before, but we set them accordingly in Galactica configs (see f.e., https://huggingface.co/mrm8488/galactica-125m/blob/main/config.json#L18). Whether these changes should be ported back to modeling_opt or the Galactica should be forked-out from it depends on how much it deviates from the general philosophy of Transformers as @sgugger noted.

younesbelkada · 2022-11-25T15:39:12Z

Hi @AnthonyHartshorn
Thanks a lot for your message. Indeed, big kudos to @Jackmin801 for the investigation, his investigation in https://huggingface.co/facebook/galactica-30b/discussions/4#637e90571dbae0919104b582 helped me define the rootcause of the bug.
I guess it can be also fixed by saving zero bias and ones for the layer norms, updating the weights on the hub with the new ones can do the trick too yes.

LysandreJik · 2022-11-28T14:09:48Z

As @sgugger said above, this goes very clearly against the foundation of transformers to add configurable parameters to a previous model architecture to support a new model architecture.

However, fixing this any other way would result in some setups breaking in the wild; it would require us to update the architecture name to galactica instead of opt which would break every existing setup that currently uses these models unless the upgrade to the latest version.

Given that, I'm also inclined to accept this change even if it goes against our design decisions. If we could do it all over again however, I would heavily push for a new model architecture.

mkardas · 2022-11-29T20:46:26Z

Thanks @LysandreJik for approving this PR. I have another related question. As pointed by Jackmin801 in the comment linked above by Anthony (paperswithcode/galai#37 (comment)), almost all of the checkpoints were converted from our float16 checkpoints and uploaded to the hub in full float32 precision (except for 30B which is an exact copy). That's not the best for user experience: download time, disk usage and loading time doubles for no benefit. I wonder if we can fix it, there are couple options I see:

upload our float16 checkpoints once this PR is merged. This would not be backward compatible as this PR is required,
do the same conversion that @mrm8488 did, but .half() the models before exporting. This would be almost backward compatible except for the case when a user doesn't specify pytorch_dtype when loading a model, as after that the models would load as float16 by default,
keep the existing checkpoints, potentially fix the 30B to be float32 as well for consistency (it wasn't working before this PR anyway). Not the best user experience,
add new checkpoints galactica-125m-fp16, ..., galactica-120b-fp16. Might be too confusing for users.

What do you think? I'm in favor of the second option as it's the best for backward compatibility and user experience.

sgugger · 2022-11-29T20:49:25Z

PyTorch automatically converts checkpoint weights to the dtype of the model when you load the state_dict, so option 2 is actually 100% backward compatible.

mkardas · 2022-11-29T21:04:56Z

Thanks @sgugger, I missed the fact that torch_dtype is part of config.json.

patrickvonplaten · 2022-11-30T11:30:52Z

This PR is ok for me - galactica is build on top of OPT so one could fine-tune OPT using these two configs => so this PR is def ok for me

younesbelkada · 2022-11-30T12:56:37Z

Thanks everyone!
@mkardas @mrm8488 : https://huggingface.co/facebook/galactica-30b/discussions/5 since now this PR has been merged, can you merge this PR to fix the initial issue for 30b ?

mkardas · 2022-12-01T11:58:31Z

@younesbelkada I'm not a member of the org yet. I've verified my work email address, but wasn't auto-added. How can I learn who are the admins?

mkardas · 2022-12-02T11:40:06Z

@patrickvonplaten can you add me to https://huggingface.co/facebook (same username)?

ArthurZucker · 2022-12-05T08:53:15Z

I can merge the PR if this is the only thing needed! 🤗

mkardas · 2022-12-05T10:49:17Z

Thanks @ArthurZucker. I was working on providing float16 weights in the backward compatible way as discussed above. I think it's best to just fix all the checkpoints to make them float16 and keep zero biases for backward compatibility with HF 4.21.0-4.24.0. I'm in the middle of preparing a new HF hub PR for this, I'll let you know in case I won't be able to merge it.

@sgugger
From my tests on backward compatibility, it seems that calling OPTForCausalLM.from_pretrained with torch_dtype=None, device_map=None results in float32 weights regardless of what's in the checkpoint bin files and config.json. However, torch_dtype=None, device_map="auto" results in the same weights type as in the checkpoint bin files, regardless of config.json. Is it to be expected?

younesbelkada · 2022-12-05T11:50:06Z

I think this is expected as if you want to load a model natively without using accelerate (i.e. without adding device_map="auto"), transformers will automatically load the weights in fp32, in this case whenever you want to load a model with its native dtype of the weights you need to use torch_dtype="auto".

sgugger · 2022-12-05T15:19:23Z

Mmm, no. If it is indeed the case then it's a bug. Do you have a small reproducer/repo ID I could look at?

mkardas · 2022-12-05T15:30:54Z

This is what I used:

import torch
from transformers import OPTForCausalLM

for device_map in [None, "auto"]:
    for dtype in [None, torch.float16, torch.float32]:
        model = OPTForCausalLM.from_pretrained(
            "facebook/galactica-125m",
            revision="refs/pr/6",
            torch_dtype=dtype,
            device_map=device_map
        )
        print(f"[device_map={device_map}]: {dtype} -> {model.lm_head.weight.dtype}")
    print()

What I get for refs/pr/6 (which has torch_dtype=float32 in config.json and float16 bin files):

[device_map=None]: None -> torch.float32
[device_map=None]: torch.float16 -> torch.float16
[device_map=None]: torch.float32 -> torch.float32

[device_map=auto]: None -> torch.float16
[device_map=auto]: torch.float16 -> torch.float16
[device_map=auto]: torch.float32 -> torch.float32

For facebook/opt-125m the output is the same, even though opt-125m has float16 both in config.json and bin files.

mkardas · 2022-12-05T17:56:56Z

PRs replacing the existing float32 checkpoints with float16 checkpoints:

https://huggingface.co/facebook/galactica-125m/discussions/6
https://huggingface.co/facebook/galactica-1.3b/discussions/6
https://huggingface.co/facebook/galactica-6.7b/discussions/8
https://huggingface.co/facebook/galactica-30b/discussions/6
https://huggingface.co/facebook/galactica-120b/discussions/6

sgugger · 2022-12-05T19:58:33Z

Found the issue. The PR mentioned above should make the result consistent between device_map=None and device_map="auto".

* fix `opt` bias * revert unneeded assignment

fix opt bias

535616b

younesbelkada requested review from sgugger and ydshieh November 22, 2022 17:01

younesbelkada commented Nov 22, 2022

View reviewed changes

src/transformers/models/opt/modeling_opt.py Outdated Show resolved Hide resolved

revert unneeded assignment

887705d

sgugger reviewed Nov 22, 2022

View reviewed changes

younesbelkada changed the title ~~[OPT] We should allow not using bias terms~~ [OPT/Galactica] Load large galactica models Nov 22, 2022

LysandreJik approved these changes Nov 28, 2022

View reviewed changes

patrickvonplaten approved these changes Nov 30, 2022

View reviewed changes

younesbelkada merged commit b75255c into huggingface:main Nov 30, 2022

sgugger mentioned this pull request Dec 5, 2022

Fix dtype of weights in from_pretrained when device_map is set #20602

Merged

younesbelkada mentioned this pull request Dec 8, 2022

Update CI to PyTorch 1.13.0 #20687

Merged

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022

[OPT/Galactica] Load large galactica models (huggingface#20390)

bffd68c

* fix `opt` bias * revert unneeded assignment

younesbelkada mentioned this pull request Dec 19, 2023

[gpt-neox] Add attention_bias config to support model trained without attention biases #28126

Merged

5 tasks

Conversation

younesbelkada commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

younesbelkada commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AnthonyHartshorn commented Nov 25, 2022

Uh oh!

mkardas commented Nov 25, 2022

Uh oh!

younesbelkada commented Nov 25, 2022

Uh oh!

LysandreJik commented Nov 28, 2022

Uh oh!

mkardas commented Nov 29, 2022

Uh oh!

sgugger commented Nov 29, 2022

Uh oh!

mkardas commented Nov 29, 2022

Uh oh!

patrickvonplaten commented Nov 30, 2022

Uh oh!

younesbelkada commented Nov 30, 2022

Uh oh!

mkardas commented Dec 1, 2022

Uh oh!

mkardas commented Dec 2, 2022

Uh oh!

ArthurZucker commented Dec 5, 2022

Uh oh!

mkardas commented Dec 5, 2022

Uh oh!

younesbelkada commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Dec 5, 2022

Uh oh!

mkardas commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkardas commented Dec 5, 2022

Uh oh!

sgugger commented Dec 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

younesbelkada commented Nov 22, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 22, 2022 •

edited

Loading

ydshieh commented Nov 22, 2022 •

edited

Loading

younesbelkada commented Nov 22, 2022 •

edited

Loading

younesbelkada commented Nov 22, 2022 •

edited

Loading

sgugger commented Nov 22, 2022 •

edited

Loading

younesbelkada commented Dec 5, 2022 •

edited

Loading

mkardas commented Dec 5, 2022 •

edited

Loading