-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Description
System Info
8x A100 GPUs with CUDA 11.3 driver
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Use the following script to save a INT8 quantized and try to load it back.
import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer
def get_max_memory_per_gpu_dict(dtype, model_name):
"""try to generate the memory map based on what we know about the model and the available hardware"""
# figure out the memory map - the minimum per gpu required to load the model
n_gpus = torch.cuda.device_count()
try:
# model_params calculation, as we don't have a model yet to do:
# model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
config = AutoConfig.from_pretrained(model_name)
h = config.hidden_size
l = config.n_layer
v = config.vocab_size
# from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
except:
logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
raise
if dtype == torch.int8:
bytes = 1
else:
bytes = torch.finfo(dtype).bits / 8
param_memory_total_in_bytes = model_params * bytes
# add 5% since weight sizes aren't the same and some GPU may need more memory
param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")
# check the real available memory
# load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
torch.ones(1).cuda()
max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
raise ValueError(
f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
)
max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
print("Max memory per gpu:", max_memory_per_gpu)
return max_memory_per_gpu
def load_model():
world_size = torch.cuda.device_count()
model_name = "bigscience/bloom"
logging.info(f"Using {world_size} gpus")
logging.info(f"Loading model {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
dtype = torch.int8
kwargs = dict(
device_map="auto",
max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
)
logging.info("Using `load_in_8bit=True` to use quanitized model")
kwargs["load_in_8bit"] = True
model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
return model, tokenizer
model, tokenizer = load_model()
model.save_pretrained("int8_model/", max_shard_size="8GB")
When loading from the directory, having the error on:
RuntimeError: Only Tensors of floating point dtype can require gradients
During the initialization of the model.
import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)
Expected behavior
The loading should pass. Looking for a workaround on it...
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels