-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Labels
Description
System Info
transformersversion: 4.44.0- Platform: Linux-6.8.0-36-generic-x86_64-with-glibc2.39
- Python version: 3.10.14
- Huggingface_hub version: 0.23.5
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: distributed
- GPU type: NVIDIA A100 80GB PCIe
Who can help?
@ArthurZucker
@Narsil
@SunMarc
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
This model was Supervised Fine Tuned from gemma-2-9b(base model) with my own SFT Dataset:
https://huggingface.co/kaki-paper/gemma-2-9b-test-model-for-debug/tree/main
The model has been trained with default chatML provided in this format:
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
{answer}
This is the inference pipeline for this model:
torch.multiprocessing.set_start_method("spawn")
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
device_map="auto",
torch_dtype=torch.bfloat16,
# load_in_8bit=True,
use_cache=False,
attn_implementation="eager",
)
model = torch.compile(model)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_len=4096, truncation_side='left')
start = time.time()
terminators = [
tokenizer.eos_token_id,
# tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
generator = pipeline(
task="text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=256,
repetition_penalty=1.2,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
eos_token_id = terminators,
)
with torch.no_grad():
for i in tqdm(range(len(df))):
print("start Eval")
doc1 = df.iloc[i, 0]
description = doc1
text = ""
text += description
format_finished = "<|im_start|>user" + text + "<|im_end|>" + "<|im_start|>assistant"
print("===============================================================")
print(format_finished)
print("===============================================================")
encoded_input = tokenizer(format_finished, truncation=True, return_tensors="pt").to("cuda")
prompt = tokenizer.decode(encoded_input["input_ids"][0], skip_special_token=False)
output = generator(prompt, return_full_text=False)
print(output)
generated_answer = output[0]["generated_text"]
df.loc[i, "generated_text"] = generated_answerand got this error:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [26,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [26,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [26,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [26,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [26,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
{...}
Traceback (most recent call last):
File "/home/howard/workspace/new_workspace/gemma2_9b_sft/trained_gemma2_9b_eval.py", line 177, in <module>
output = generator(prompt, return_full_text=False)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 262, in __call__
return super().__call__(text_inputs, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1257, in __call__
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1264, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1164, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 351, in _forward
generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2982, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 999, in forward
outputs = self.model(
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 847, in forward
layer_outputs = decoder_layer(
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 599, in forward
hidden_states = self.post_attention_layernorm(hidden_states)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 121, in forward
output = self._norm(x.float())
File "/data/howard/new_workspace/.venv/lib/python3.10/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 118, in _norm
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Expected behavior
I knew there were some similar issues before(#31848), but still got this issue on the latest transformers version.
Reactions are currently unavailable