[INT8] BLOOM series model loading back issue

### System Info

8x A100 GPUs with CUDA 11.3 driver

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Use the following script to save a INT8 quantized and try to load it back.
```
import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer

def get_max_memory_per_gpu_dict(dtype, model_name):
    """try to generate the memory map based on what we know about the model and the available hardware"""

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    try:
        # model_params calculation, as we don't have a model yet to do:
        # model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.hidden_size
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
        model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
    except:
        logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    if dtype == torch.int8:
        bytes = 1
    else:
        bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
    logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(
            f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
        )

    max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
    print("Max memory per gpu:", max_memory_per_gpu)
    return max_memory_per_gpu


def load_model():
    world_size = torch.cuda.device_count()
    model_name = "bigscience/bloom"
    logging.info(f"Using {world_size} gpus")
    logging.info(f"Loading model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    dtype = torch.int8
    kwargs = dict(
        device_map="auto",
        max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
    )
    logging.info("Using `load_in_8bit=True` to use quanitized model")
    kwargs["load_in_8bit"] = True
    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
    return model, tokenizer

model, tokenizer = load_model()

model.save_pretrained("int8_model/", max_shard_size="8GB")
```

When loading from the directory, having the error on:

```
RuntimeError: Only Tensors of floating point dtype can require gradients
```

During the initialization of the model.

```
import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)
```

### Expected behavior

The loading should pass.  Looking for a workaround on it... 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INT8] BLOOM series model loading back issue #19480

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[INT8] BLOOM series model loading back issue #19480

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions