Skip to content

[INT8] BLOOM series model loading back issue #19480

@lanking520

Description

@lanking520

System Info

8x A100 GPUs with CUDA 11.3 driver

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Use the following script to save a INT8 quantized and try to load it back.

import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer

def get_max_memory_per_gpu_dict(dtype, model_name):
    """try to generate the memory map based on what we know about the model and the available hardware"""

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    try:
        # model_params calculation, as we don't have a model yet to do:
        # model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.hidden_size
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
        model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
    except:
        logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    if dtype == torch.int8:
        bytes = 1
    else:
        bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
    logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(
            f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
        )

    max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
    print("Max memory per gpu:", max_memory_per_gpu)
    return max_memory_per_gpu


def load_model():
    world_size = torch.cuda.device_count()
    model_name = "bigscience/bloom"
    logging.info(f"Using {world_size} gpus")
    logging.info(f"Loading model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    dtype = torch.int8
    kwargs = dict(
        device_map="auto",
        max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
    )
    logging.info("Using `load_in_8bit=True` to use quanitized model")
    kwargs["load_in_8bit"] = True
    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
    return model, tokenizer

model, tokenizer = load_model()

model.save_pretrained("int8_model/", max_shard_size="8GB")

When loading from the directory, having the error on:

RuntimeError: Only Tensors of floating point dtype can require gradients

During the initialization of the model.

import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)

Expected behavior

The loading should pass. Looking for a workaround on it...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions