Skip to content

Error while loading a model on 8bit #21371

@toma-x

Description

@toma-x

I'm trying to run inference on a model which doesn't fit on my GPU using this code:

import torch

device_map = {'transformer.wte': 0,
 'transformer.drop': 0,
 'transformer.h.0': 0,
 'transformer.h.1': 0,
 'transformer.h.2': 0,
 'transformer.h.3': 0,
 'transformer.h.4': 0,
 'transformer.h.5': 0,
 'transformer.h.6': 0,
 'transformer.h.7': 0,
 'transformer.h.8': 0,
 'transformer.h.9': 0,
 'transformer.h.10': 0,
 'transformer.h.11': 0,
 'transformer.h.12': 0,
 'transformer.h.13': 0,
 'transformer.h.14': 0,
 'transformer.h.15': 0,
 'transformer.h.16': 0,
 'transformer.h.17': 0,
 'transformer.h.18': 0,
 'transformer.h.19': 0,
 'transformer.h.20': 0,
 'transformer.h.21': 0,
 'transformer.h.22': 0,
 'transformer.h.23': 'cpu',
 'transformer.h.24': 'cpu',
 'transformer.h.25': 'cpu',
 'transformer.h.26': 'cpu',
 'transformer.h.27': 'cpu',
 'transformer.ln_f': 'cpu',
 'lm_head': 'cpu'}
tokenizer = AutoTokenizer.from_pretrained("tomaxe/fr-boris-sharded")
model = AutoModelForCausalLM.from_pretrained("tomaxe/fr-boris-sharded", load_in_8bit = True, load_in_8bit_skip_modules = ['lm_head',
                                                                                                                          'transformer.ln_f',
                                                                                                                          'transformer.h.27',
                                                                                                                          'transformer.h.26',
                                                                                                                          'transformer.h.25',
                                                                                                                          'transformer.h.24',
                                                                                                                          'transformer.h.23'], device_map = device_map)
input_text = "salut"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length = 20)
print(tokenizer.decode(outputs[0]))

And I'm running into this error :
@younesbelkada Do you know what I could do ? Thanks

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /home/thomas/anaconda3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A: torch.Size([2, 4096]), B: torch.Size([4096, 4096]), C: (2, 4096); (lda, ldb, ldc): (c_int(64), c_int(131072), c_int(64)); (m, n, k): (c_int(2), c_int(4096), c_int(4096))
Traceback (most recent call last):

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "/home/thomas/Downloads/infersharded.py", line 46, in <module>
    outputs = model.generate(input_ids, max_length = 20)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1391, in generate
    return self.greedy_search(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2179, in greedy_search
    outputs = self(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 813, in forward
    transformer_outputs = self.transformer(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 668, in forward
    outputs = block(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 302, in forward
    attn_outputs = self.attn(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 203, in forward
    query = self.q_proj(hidden_states)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 254, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 405, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 311, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')

Exception: cublasLt ran into an error!

cuBLAS API failed with status 15
error detected```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions