Skip to content

Performance Regression from commit 7dcd870 #22683

@fpgaminer

Description

@fpgaminer

System Info

  • transformers version: 4.28.0.dev0 (656e869)
  • Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.35
  • Python version: 3.10.10
  • Huggingface_hub version: 0.13.4
  • Safetensors version: 0.3.0
  • PyTorch version (GPU?): 2.0.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I have a benchmark script which benchmarks the generation speed of different LLaMA models. Before commit 7dcd870 my generation speed averaged around 48 tokens/s in ideal cases, RTX 3090. After that commit the average speed is 43 tokens/s.

The specific issue seems to be the change to apply_rotary_pos_emb. My guess is the change from a rather simple slicing of two Tensors to a scatter-gather.

To test my theory I patched apply_rotary_pos_emb to its pre 7dcd870 state, and minimally modified LlamaAttention accordingly. No other modifications. Speed jumped back to 48 tokens/s.

The problem should apply generally, but the specific script I'm using is: https://github.com/fpgaminer/GPTQ-triton/blob/99ec4a3adb7fad9de33ff026bbfb64cbb3bab2f8/benchmark_generate.py

Expected behavior

I would not expect a 10% drop in performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions