[Qwen2Audio] handle input ids expansion during processing by eustlb · Pull Request #35534 · huggingface/transformers

eustlb · 2025-01-06T15:54:06Z

What does this PR do?

This PR adds input_id expansion to the processor.

Before:

embedding of the input_ids"<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:" → inputs_embeds tensor
creation of a new embedding tensor here, taking into account the number of audio tokens to be merged with inputs_embeds
replacement of audio embedding vectors

Now:

expansion of input ids at the processing stage: "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:" → "<|audio_bos|><|AUDIO|>...<|AUDIO|><|audio_eos|>Generate the caption in English:" with as many <|AUDIO|> tokens as the number of audio embedded vectors after encoding and projection
creation of inputs_embeds, directly with the correct shape
replacement of the audio embedding vectors

This approach allows to remove the unnecessary step of creating a new embedding tensor in the model forward. Moreover, it simplifies the code as correct padding is directly handled at the processing stage.

ArthurZucker

Nice!

src/transformers/models/qwen2_audio/processing_qwen2_audio.py

src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

src/transformers/models/qwen2_audio/processing_qwen2_audio.py

tests/models/qwen2_audio/test_modeling_qwen2_audio.py

HuggingFaceDocBuilderDev · 2025-01-06T16:46:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks 🤗

ArthurZucker · 2025-01-07T10:57:01Z

tests/models/qwen2_audio/test_modeling_qwen2_audio.py

+        # test the error when incorrect number of audio tokens
+        inputs["input_ids"] = torch.tensor(
+            [
+                [
+                    151644,
+                    8948,
+                    198,
+                    2610,
+                    525,
+                    264,
+                    10950,
+                    17847,
+                    13,
+                    151645,
+                    198,
+                    151644,
+                    872,
+                    198,
+                    14755,
+                    220,
+                    16,
+                    25,
+                    220,
+                    151647,
+                ]
+                + [151646] * 200
+                + [
+                    151648,
+                    198,
+                    3838,
+                    594,
+                    429,
+                    5112,
+                    30,
+                    151645,
+                    198,
+                    151644,
+                    77091,
+                    198,
+                ]
+            ]
+        )


use # fmt: skip

eustlb added 4 commits January 6, 2025 16:25

add audio_token attribute to proc

1cd050c

expand input_ids

4de1294

and legacy and expanded input_ids

9b82708

test update

89d0d1b

ArthurZucker mentioned this pull request Jan 6, 2025

VLMs: major clean up 🧼 #34502

Merged

ArthurZucker reviewed Jan 6, 2025

View reviewed changes

split lines

bd4ec1c

DarkLight1337 mentioned this pull request Jan 6, 2025

[Model] Future-proof Qwen2-Audio multi-modal processor vllm-project/vllm#11776

Merged

eustlb added 4 commits January 6, 2025 19:02

add possibility not to provide eos and bos audio tokens

fa85ac4

raise errors

71ce83f

test incorrect number of audio tokens

09870d8

add example

4569047

ArthurZucker approved these changes Jan 7, 2025

View reviewed changes

eustlb and others added 3 commits January 7, 2025 16:35

fmt

d8227e4

typo

e7b4826

Merge branch 'main' into refactor-qwenaudio

416fe37

eustlb merged commit 7f76773 into huggingface:main Jan 7, 2025
4 checks passed

superfan89 mentioned this pull request Jan 18, 2025

[Bug]: Unable to serve Qwen2-audio in V1 vllm-project/vllm#12168

Closed

1 task

achartier mentioned this pull request Mar 25, 2025

chore: Handle qwen2audio inputs ids expansion during processing NVIDIA/TensorRT-LLM#3080

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen2Audio] handle input ids expansion during processing#35534

[Qwen2Audio] handle input ids expansion during processing#35534
eustlb merged 12 commits intohuggingface:mainfrom
eustlb:refactor-qwenaudio

eustlb commented Jan 6, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Jan 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eustlb commented Jan 6, 2025

What does this PR do?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 6, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants