[CLIP] add: sdpa support to clip. by sayakpaul · Pull Request #30390 · huggingface/transformers

sayakpaul · 2024-04-22T11:00:01Z

What does this PR do?

CLIP is heavily used in the diffusion modeling space for text encoding. So, having native SDPA support for CLIP would be beneficial for diffusion models both for training and inference.

The test failures can be tackled later once we match the logits to non-SDPA. Here's my test script:

from transformers import AutoTokenizer, CLIPTextModel
import torch

model_sdpa = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", attn_implementation="sdpa").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model_sdpa(**inputs)
    last_hidden_state_sdpa = outputs.last_hidden_state

model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", attn_implementation="eager").to("cuda")

with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state

print(last_hidden_state_sdpa[0, :3, -1].flatten())
print(last_hidden_state[0, :3, -1].flatten())
print(torch.allclose(last_hidden_state_sdpa, last_hidden_state, rtol=1e-2, atol=1e-2))

These don't pass:

tensor([ 0.1013,  0.0805, -0.8429], device='cuda:0')
tensor([ 0.1013, -0.9801, -1.9444], device='cuda:0')
False

I inspected the key_states, value_states, and the query_states in CLIPAttention and CLIPSdpaAttention, respectively. In both classes, these matrices are of same value. The differences start arising from attn_output.

SDPA attn_output[0, :3, -1]=tensor([0.3397, 0.2862, 0.5571], device='cuda:0')
SDPA attn_output[0, :3, -1]=tensor([0.0143, 0.0158, 0.0971], device='cuda:0')
SDPA attn_output[0, :3, -1]=tensor([0.0082, 0.0130, 0.1147], device='cuda:0')

Non SDPA attn_output[0, :3, -1]=tensor([0.3397, 0.2576, 0.7800], device='cuda:0')
Non SDPA attn_output[0, :3, -1]=tensor([ 0.0143,  0.0069, -0.0258], device='cuda:0')
Non SDPA attn_output[0, :3, -1]=tensor([ 0.0082,  0.0046, -0.0023], device='cuda:0')

I suspect this is happening because of how masking is handled in CLIPAttention. We first apply causal mask:

transformers/src/transformers/models/clip/modeling_clip.py

Line 285 in 8b02bb6

# apply the causal_attention_mask first

And then we apply attention mask:

transformers/src/transformers/models/clip/modeling_clip.py

Line 295 in 8b02bb6

if attention_mask is not None:

But I think it deviates a bit from CLIPSdpaAttention. Not very sure, though. Hence I am opening this PR for seeking feedback.

I have added some comments in line to provide further clarification on some points.

sayakpaul · 2024-04-22T11:03:03Z

src/transformers/models/clip/modeling_clip.py

+    def __init__(self, *args, **kwargs):
+        is_causal = kwargs.pop("is_causal", False)
+        super().__init__(*args, **kwargs)
+        self.is_causal = is_causal


To directly provide is_causal to F.scale_dot_product_attention.

sayakpaul · 2024-04-22T11:04:47Z

src/transformers/models/clip/modeling_clip.py

+        if config._attn_implementation == "sdpa":
+            self.self_attn = CLIP_ATTENTION_CLASSES[config._attn_implementation](config, is_causal=is_causal)
+        else:
+            self.self_attn = CLIP_ATTENTION_CLASSES[config._attn_implementation](config)


We don't use causal masking in the vision tower of CLIP hence this conditioning.

sayakpaul · 2024-04-22T11:07:24Z

src/transformers/models/clip/modeling_clip.py

        text_config = config.text_config
+        text_config._attn_implementation = config._attn_implementation
        vision_config = config.vision_config
+        vision_config._attn_implementation = config._attn_implementation


The image classification class has:

transformers/src/transformers/models/clip/modeling_clip.py

Line 1332 in 8b02bb6

self.vision_model = CLIPVisionTransformer(config.vision_config)

If we don't propagate vision_config._attn_implementation = config._attn_implementation the vision config won't have any way to know about the actual _attn_implementation.

However, correct me if I am wrong.

You're completely right! However, the way it's propogated in other models is by passing in the model construction e.g.

self.text_model = CLIPTextTransformer.from_config(text_config, attn_implementation=config._attn_implementation)

I'd encourage doing it this way for two reasons:

It's the pattern done for other models. If there's an issue we need to update in the code, we're more likely to find it if it matches

The way the attention implementation is set on the config is less than ideal and (imo) prone to unexpected behaviour. Annoyingly, there's a bunch of magic which happens in the setter, which can cause it to appear to magically change or revert back. This way I'm sure (atm) works, I'm not sure that setting on the config like this and then passing to the model will always work, depending on the attention implementation the original config e.g. vision_config

Done in 89dab66. Keeping this comment open because it seems like an important thing for us to consider.

tests/test_modeling_common.py

HuggingFaceDocBuilderDev · 2024-04-22T11:19:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-04-22T12:20:24Z

Thanks for working on this @sayakpaul! Let us know when it's ready for review i.e. when all tests are passing and there's no more changes to be made.

sayakpaul · 2024-04-22T12:23:13Z

Oh I thought the PR description made it clear. Sorry if I didn't.

The test failures can be tackled later once we match the logits to non-SDPA.

Not very sure, though. Hence I am opening this PR for seeking feedback.

So, concretely, it would be great to have some initial reviews first so that we can localize why the basic logit assertion test isn't passing.

ArthurZucker

IMO we should avoid the complexe prepare_4d etc, and just use the _update_causal_mask. Now that's not super possible as would require deprecating, but anyways, not against this PR, let's just make sure we have equivalence and support dispatching to the appropriate kernels!

src/transformers/models/clip/modeling_clip.py

ArthurZucker · 2024-04-22T13:09:59Z

src/transformers/models/clip/modeling_clip.py

+            query_states,
+            key_states,
+            value_states,
+            attn_mask=attention_mask,


you seem to only be using the attention mask, vs using the causal mask and the attention mask

src/transformers/models/clip/modeling_clip.py

sayakpaul · 2024-04-22T13:22:55Z

IMO we should avoid the complexe prepare_4d etc, and just use the _update_causal_mask. Now that's not super possible as would require deprecating, but anyways, not against this PR,

@ArthurZucker if it’s easier/benefitting for the library, happy to use the methods you are suggesting. But I don’t get why things would need deprecating, etc. If you could provide more reference, that would be helpful.

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

sayakpaul · 2024-04-23T13:14:37Z

I disabled is_causal in F.scaled_dot_product_attention() and decided to handle the masking with the factory methods from the library itself. This is because CLIP applies BOTH causal mask and regular attention mask:

https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/models/clip/modeling_clip.py#L287C1-L303C85

However, this leads to:

=========================== short test summary info ============================
FAILED tests/models/clip/test_modeling_clip.py::CLIPVisionModelTest::test_eager_matches_sdpa_inference_1_bfloat16
FAILED tests/models/clip/test_modeling_clip.py::CLIPTextModelTest::test_eager_matches_sdpa_inference_1_bfloat16
FAILED tests/models/clip/test_modeling_clip.py::CLIPTextModelTest::test_eager_matches_sdpa_inference_2_float32
FAILED tests/models/clip/test_modeling_clip.py::CLIPTextModelTest::test_sdpa_can_dispatch_on_flash
FAILED tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_eager_matches_sdpa_inference_0_float16
FAILED tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_eager_matches_sdpa_inference_1_bfloat16
FAILED tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_eager_matches_sdpa_inference_2_float32
FAILED tests/models/clip/test_modeling_clip.py::CLIPModelTest::test_sdpa_can_dispatch_on_flash
FAILED tests/models/clip/test_modeling_clip.py::CLIPForImageClassificationModelTest::test_eager_matches_sdpa_inference_1_bfloat16
========== 9 failed, 208 passed, 159 skipped, 59 warnings in 40.40s ===========

Amongst these, failures for test_sdpa_can_dispatch_on_flash() are of particular interest. Would appreciate some pointers on approaching this. Would be also helpful to validate the treatment I am giving to the masks when using CLIPSdpaAttention.

sayakpaul · 2024-04-24T02:05:43Z

This passes:

from transformers import AutoTokenizer, CLIPTextModel
import torch

model_sdpa = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", attn_implementation="sdpa").to("cuda")
model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32", attn_implementation="eager").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], return_tensors="pt")
print(inputs["attention_mask"].tolist())
inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs_sdpa = model_sdpa(**inputs)
    last_hidden_state_sdpa = outputs_sdpa.last_hidden_state
    outputs_eager = model(**inputs)
    last_hidden_state = outputs_eager.last_hidden_state


print(last_hidden_state_sdpa[0, :3, -1].flatten())
print(last_hidden_state[0, :3, -1].flatten())
print(torch.allclose(last_hidden_state_sdpa, last_hidden_state, rtol=1e-3, atol=1e-3))

ArthurZucker

Thanks for the PR, could you share the expected speed boost benchmark? (and add it to the clip.md) 🤗

src/transformers/models/clip/modeling_clip.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

sayakpaul · 2024-04-26T14:46:27Z

@ArthurZucker done :)

sayakpaul · 2024-05-24T04:01:34Z

@amyeroberts am I doing something wrong when running make style && make quality?

Here's what I have done:

Created a fresh environment for the formatting related libs.
Ran pip install -e ".[quality]" inside the env.
Then from the root of transformers, ran make style && make quality.

amyeroberts · 2024-05-24T10:09:52Z

@sayakpaul Could you try running make fixup?

sayakpaul · 2024-05-24T10:21:50Z

Leads to:

amyeroberts · 2024-05-24T12:16:18Z

@sayakpaul Huh, weird, I haven't seen that before. I'm going to try and re-trigger a fresh CI run, as it doesn't seem to be anything related to this PR

amyeroberts

Just some final comments on the propogation of attn_implementation

src/transformers/models/clip/modeling_clip.py

amyeroberts

Thanks for adding this!

Sorry for the bad suggestion re from_config - I didn't realise it's from the autoclass. Switching back to _from_config as you had it before should resolve!

sayakpaul · 2024-05-25T03:36:35Z

@amyeroberts I had to touch a couple of loading utilities to make sure the equivalence tests pass. LMK if they should have been approached differently.

sayakpaul · 2024-05-25T04:28:30Z

Ah oh, the FLAX tests still fail the equivalence. I tried a bunch of state dict rejigging in order for the PT related changes to propagate in the FLAX model but none of them worked out. Would appreciate some guidance.

amyeroberts · 2024-05-28T14:50:10Z

Pinging @sanchit-gandhi here - who knows most of the intricacies of our flax models :)

vasqu · 2024-05-31T01:20:28Z

src/transformers/models/clip/modeling_clip.py

+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+
+        query_states = query_states.view(bsz, -1, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, -1, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, -1, self.num_heads, self.head_dim).transpose(1, 2)


We should be careful about using them (non-)contiguous since torch has a bug in at least version 2.1.2. See the reference given in the llama implementation.

transformers/src/transformers/models/llama/modeling_llama.py

Lines 637 to 642 in 6bd511a

# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,

# Reference: https://github.com/pytorch/pytorch/issues/112577.

if query_states.device.type == "cuda" and causal_mask is not None:

query_states = query_states.contiguous()

key_states = key_states.contiguous()

value_states = value_states.contiguous()

So calling .contiguous() here (with given checks) could be a solution like demonstrated above. You could also move the whole projections into something like

query_states = self._shape(self.q_proj(hidden_states), -1, bsz) key_states = self._shape(self.k_proj(hidden_states), -1, bsz) value_states = self._shape(self.v_proj(hidden_states), -1, bsz)

which automatically calls contiguous under the _shape function.

sayakpaul · 2024-07-18T11:22:19Z

Closing in favor of #31940.

add: sdpa support to clip.

0a53be7

sayakpaul requested review from ArthurZucker and amyeroberts April 22, 2024 11:00

sayakpaul commented Apr 22, 2024

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Apr 22, 2024

View reviewed changes

sayakpaul and others added 4 commits April 23, 2024 08:14

Update src/transformers/models/clip/modeling_clip.py

1543453

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

fix: scaling inside sdpa.

bd485ee

test fixes.

da0e54e

handle causal and attention mask

5f8dde9

sayakpaul added 2 commits April 24, 2024 09:15

up

deaef1e

styling.g

55ffc54

sayakpaul requested a review from ArthurZucker April 24, 2024 04:05

add to docs.

92b0ae6

ArthurZucker reviewed Apr 25, 2024

View reviewed changes

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

Apply suggestions from code review

876db65

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

sayakpaul marked this pull request as ready for review April 26, 2024 14:38

sayakpaul changed the title ~~[WIP] [CLIP] add: sdpa support to clip.~~ [CLIP] add: sdpa support to clip. Apr 26, 2024

sayakpaul added 2 commits April 26, 2024 20:11

Merge branch 'main' into add-clip-sdpa

0d414ae

address arthur's comments.

53489a4

fix conflicts

c2f8fab

sayakpaul added 4 commits May 24, 2024 09:11

merge main and resolve conflicts.

0377d0f

quality

961a529

use from_config.

267c5b0

style again

ad0ab44

amyeroberts reviewed May 24, 2024

View reviewed changes

propagation of attn_implementation.

d57d87b

sayakpaul requested a review from amyeroberts May 24, 2024 13:17

amyeroberts approved these changes May 24, 2024

View reviewed changes

sayakpaul added 7 commits May 25, 2024 07:43

Merge branch 'main' into add-clip-sdpa

628b498

_from_config.

d02e83e

fix attribute access.

13cf0bc

fix tensorflow attribute name.

c35ce32

fix: flax attribute access.

9a15f1e

add comment to explain why I had to touch forbidden codebase.

d02dfcb

style

a7916e0

sayakpaul requested a review from amyeroberts May 25, 2024 03:36

vasqu reviewed May 31, 2024

View reviewed changes

huggingface deleted a comment from github-actions bot Jun 24, 2024

sayakpaul mentioned this pull request Jun 25, 2024

Add FA2 and sdpa support for SigLIP #31499

Merged

4 tasks

qubvel mentioned this pull request Jul 12, 2024

Add sdpa and FA2 for CLIP #31940

Merged

4 tasks

sayakpaul closed this Jul 18, 2024

sayakpaul deleted the add-clip-sdpa branch July 25, 2024 08:06

	# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
	# Reference: https://github.com/pytorch/pytorch/issues/112577.
	if query_states.device.type == "cuda" and causal_mask is not None:
	query_states = query_states.contiguous()
	key_states = key_states.contiguous()
	value_states = value_states.contiguous()

Conversation

sayakpaul commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

sayakpaul Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

amyeroberts May 2, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul May 3, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2024

Uh oh!

amyeroberts commented Apr 22, 2024

Uh oh!

sayakpaul commented Apr 22, 2024

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented Apr 22, 2024

Uh oh!

sayakpaul commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Apr 24, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Apr 26, 2024

Uh oh!

sayakpaul commented May 24, 2024

Uh oh!

amyeroberts commented May 24, 2024

Uh oh!

sayakpaul commented May 24, 2024

Uh oh!

amyeroberts commented May 24, 2024

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented May 25, 2024

Uh oh!

sayakpaul commented May 25, 2024

Uh oh!

amyeroberts commented May 28, 2024

Uh oh!

vasqu May 31, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Jul 18, 2024

Uh oh!

Reviewers

sayakpaul commented Apr 22, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

sayakpaul commented Apr 23, 2024 •

edited

Loading