-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Description
System Info
transformersversion: 4.31.0- Platform: Linux-5.10.112-005.ali5000.alios7.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.1+cu117 (False)
- Tensorflow version (GPU?): 2.11.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Text models:@ArthurZucker and @younesbelkada
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
in transformers/models/gpt_neox/modeling_gpt_neox.py, line320:
return self.cos_cached[:seq_len, ...].to(x.device), self.sin_cached[:seq_len, ...].to(x.device)lack 2 dimensions before seq_lenth.
This will not cause bug, because seq_lenth is always larger than 1, but it will cause a defect use of cache by use the whole cache ever, maybe lead to poor performance when inference?
Expected behavior
Maybe the right code should be:
return self.cos_cached[:, :, :seq_len, ...].to(x.device), self.sin_cached[:, :, :seq_len, ...].to(x.device)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels