Skip to content

Conversation

@younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Mar 15, 2023

What does this PR do?

Before this PR, it was not possible to save an 8bit model, or load an 8bit model from the Hub. This PR makes this feature possible. If this PR gets merged, users can upload 8bit models on the Hub and/or load 8bit models from the Hub, hence save 2x memory compared to half-precision models.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

>>> Hello my name is Nate, I am a professional photographer and I am a member of the
model.save_pretrained("./saved_int8")

model = AutoModelForCausalLM.from_pretrained("./saved_int8")

outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))
>>> Hello my name is Nate, I am a professional photographer and I am a member of the

Depends on bitsandbytes-foundation/bitsandbytes#159

Let's put it as draft before I address the last TODOs and open questions & before bitsandbytes-foundation/bitsandbytes#159 gets merged.

TODOs and open questions:

  • ability to push BitsAndBytesConfig
  • Do we want to save the serialized model under the name pytorch_model.bin ? I would say yes for simplicity reasons but we need to make sure that a user calls from_pretrained with load_in_8bit, hence add a warning if there is a quantization_config.json on the Hub repo + the user is not passing load_in_8bit=True.
  • Force load_in_8bit=True if there is a quantization_config.json on the Hub repo?
  • Update docs
  • Update warnings
  • Safety checkers for bnb versions
  • Add a test to check if it works using sharded fp16 weights

cc @sgugger I left few open questions, would love to hear your thoughts on these!

@younesbelkada younesbelkada marked this pull request as draft March 15, 2023 08:46
@younesbelkada younesbelkada requested a review from sgugger March 15, 2023 08:49
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 15, 2023

The documentation is not available anymore as the PR was closed or merged.

@sgugger
Copy link
Collaborator

sgugger commented Mar 15, 2023

The design is not easy enough to use. If a user saves a quantized model and pushes to the Hub, it should work directly with from_pretrained. This is why I insisted that the quantization config should be saved inside the model config. This way you won't need to have the user pass load_in_8_bit=True, as you can read it from the config.

@younesbelkada
Copy link
Contributor Author

awesome ok, I'll work on that, so if there is a quantized config on the repo we should force-use device_map=auto & load_in_8bit in this case

Comment on lines 2132 to 2135
# TODO: uncomment this after the next release of bitsandbytes
# can_serialize_bnb = version.parse(
# importlib_metadata.version("bitsandbytes")
# ) >= version.parse("0.37.2")
Copy link
Contributor Author

@younesbelkada younesbelkada Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be changed and I will remove the hardcoded value below on the next bitsandbytes release, of course I don't expect to merge this before the next bnb release

@younesbelkada younesbelkada requested review from sgugger and removed request for sgugger March 29, 2023 11:58
@younesbelkada younesbelkada marked this pull request as ready for review March 29, 2023 12:02
@younesbelkada
Copy link
Contributor Author

The PR is ready for review @sgugger !
This PR is not mergeable before the bnb release of course

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we didn't understand each other ;-)
The quantization config should be saved as part of the model config (the model weights cannot be used without it anyway, once the model is quantized it's really part of the information necessary to load it) It will make all the download code irrelevant and this PR muche easier.


### Load a quantized model from the 🤗 Hub

You can load a quantized model from the Hub by using `from_pretrained` method. Make sure that the pushed weights are quantized, by checking that the file `quantization_config.json` is present in the model repository.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should be part of the model config, it doesn't need to be in its own file.

importlib_metadata.version("bitsandbytes")
) >= version.parse("0.37.0")

model.quantization_config = quantization_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything would be much simpler if you set model.config.quantization_config = quantizatio_config.

@younesbelkada
Copy link
Contributor Author

Thanks for the heads up! :D
It should be much better now! For me the PR is ready for a review now

@younesbelkada younesbelkada requested a review from sgugger March 31, 2023 09:18
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks a lot for iterating!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple last nits!

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@sgugger sgugger merged commit 370f0ca into huggingface:main Apr 12, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…#22177)

* make serialization of int8 models possible

* make fixup

* add docs

* add ability to push to hub and save pretrained

* fixes

* more addition

* more tests

* fix issues

* change variable

* clearer message

* adapt from suggestions

* few fixes

* remove unused function

* Update src/transformers/utils/quantization_config.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* address last comments

* last warning

* clarify doc

* protect import

* Update src/transformers/modeling_utils.py

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

---------

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants