Skip to content

[ROCm] Directly access scalars if largeBar is enabled#177023

Closed
pschlan-amd wants to merge 12 commits intopytorch:mainfrom
pschlan-amd:rocm_scalar_largebar
Closed

[ROCm] Directly access scalars if largeBar is enabled#177023
pschlan-amd wants to merge 12 commits intopytorch:mainfrom
pschlan-amd:rocm_scalar_largebar

Conversation

@pschlan-amd
Copy link
Copy Markdown
Contributor

@pschlan-amd pschlan-amd commented Mar 10, 2026

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

Benchmark

Simple synthetic benchmark:

import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))

Result before the change on MI-355X:

# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s

After the change:

# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177023

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b5351d6 with merge base c8917e2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Mar 10, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 10, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@pschlan-amd pschlan-amd force-pushed the rocm_scalar_largebar branch from 77d818d to 3e51a29 Compare March 10, 2026 13:15
@pschlan-amd pschlan-amd marked this pull request as ready for review March 10, 2026 13:40
@jeffdaily jeffdaily added release notes: rocm mandatorylabel ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 ciflow/trunk Trigger trunk jobs on your pull request and removed release notes: cuda release notes category labels Mar 10, 2026
@pschlan-amd pschlan-amd reopened this Mar 11, 2026
@pschlan-amd pschlan-amd force-pushed the rocm_scalar_largebar branch from 3e51a29 to 107ced3 Compare March 11, 2026 09:22
@pytorch-bot pytorch-bot Bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Mar 11, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

Need to wait on this PR until #173330 is relanded.

@soulitzer soulitzer requested a review from jeffdaily March 16, 2026 21:58
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 16, 2026
@jeffdaily jeffdaily added the ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 label Mar 24, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 25, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cuda12.8-py3 / build

Details for Dev Infra team Raised by workflow job

@pytorch-bot pytorch-bot Bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 ciflow/inductor-rocm-mi200 Trigger "inductor" config CI on ROCm MI200 ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 labels Mar 26, 2026
@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 26, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copilot AI pushed a commit that referenced this pull request Mar 27, 2026
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

## Benchmark
Simple synthetic benchmark:
```python
import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))
```

Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```

After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```

Pull Request resolved: #177023
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

## Benchmark
Simple synthetic benchmark:
```python
import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))
```

Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```

After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```

Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
pytorch-bot Bot pushed a commit that referenced this pull request Apr 2, 2026
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

## Benchmark
Simple synthetic benchmark:
```python
import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))
```

Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```

After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```

Pull Request resolved: #177023
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
IvanKobzarev pushed a commit to IvanKobzarev/pytorch that referenced this pull request Apr 3, 2026
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

## Benchmark
Simple synthetic benchmark:
```python
import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))
```

Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```

After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```

Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

## Benchmark
Simple synthetic benchmark:
```python
import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))
```

Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```

After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```

Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: rocm mandatorylabel triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants