[BE][Attention] Code de-dup#139784
Conversation
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139784
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2ed9300 with merge base 4d5cc1b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]
| // to mask *out*. | ||
| if (attn_mask->dtype() == at::kBool) { | ||
| return at::where(attn_mask->logical_not(), -std::numeric_limits<double>::infinity(), at::scalar_tensor(0.0, at::TensorOptions().dtype(dtype).device(attn_mask->device()))); | ||
| return at::where(*attn_mask, 0.0, at::scalar_tensor(neg_inf, at::TensorOptions().dtype(dtype).device(attn_mask->device()))); |
There was a problem hiding this comment.
we should probably also have 0.0 be a scalar tensor right? I am not sure why this didnt also need to be updated
There was a problem hiding this comment.
I suspect only one scalar_tensor is needed to preserve out dtype, other can be inferred from the first.
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call Also, as suggested by Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)` [ghstack-poisoned]
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: #139763 Approved by: https://github.com/Skylion007 ghstack dependencies: #139788, #139784
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes #139389
Pull Request resolved: #139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)` Pull Request resolved: pytorch#139784 Approved by: https://github.com/Skylion007, https://github.com/drisspg ghstack dependencies: pytorch#139788
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784
…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:  Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763
Stack from ghstack (oldest at bottom):
isneginf#139763The only difference between
convert_boolean_attn_mask_cudnnandconvert_boolean_attn_maskis the value we initialize boolean tensor toReduce duplication by introducing
convert_boolean_attn_mask_that takesneg_infvalue and make abovementioned implementations are trivial oneline callAlso, as suggested by @Skylion007, replace
at::where(foo->logical_not, -inf, 0)withat::where(*foo, 0, -inf)