[ROCm] Directly access scalars if largeBar is enabled#177023
[ROCm] Directly access scalars if largeBar is enabled#177023pschlan-amd wants to merge 12 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177023
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b5351d6 with merge base c8917e2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
77d818d to
3e51a29
Compare
3e51a29 to
107ced3
Compare
|
Need to wait on this PR until #173330 is relanded. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cuda12.8-py3 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
## Benchmark
Simple synthetic benchmark:
```python
import torch
import time
device = torch.device("cuda")
N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)
start = time.time()
for i in range(0, N):
x = tensor[i].item()
assert(x == i)
end = time.time()
print('retrieving {} scalars took {} s'.format(N, end - start))
```
Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```
After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```
Pull Request resolved: #177023
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
## Benchmark
Simple synthetic benchmark:
```python
import torch
import time
device = torch.device("cuda")
N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)
start = time.time()
for i in range(0, N):
x = tensor[i].item()
assert(x == i)
end = time.time()
print('retrieving {} scalars took {} s'.format(N, end - start))
```
Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```
After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```
Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
## Benchmark
Simple synthetic benchmark:
```python
import torch
import time
device = torch.device("cuda")
N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)
start = time.time()
for i in range(0, N):
x = tensor[i].item()
assert(x == i)
end = time.time()
print('retrieving {} scalars took {} s'.format(N, end - start))
```
Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```
After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```
Pull Request resolved: #177023
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
## Benchmark
Simple synthetic benchmark:
```python
import torch
import time
device = torch.device("cuda")
N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)
start = time.time()
for i in range(0, N):
x = tensor[i].item()
assert(x == i)
end = time.time()
print('retrieving {} scalars took {} s'.format(N, end - start))
```
Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```
After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```
Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
## Benchmark
Simple synthetic benchmark:
```python
import torch
import time
device = torch.device("cuda")
N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)
start = time.time()
for i in range(0, N):
x = tensor[i].item()
assert(x == i)
end = time.time()
print('retrieving {} scalars took {} s'.format(N, end - start))
```
Result before the change on MI-355X:
```
# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s
```
After the change:
```
# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s
```
Pull Request resolved: pytorch#177023
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.
Benchmark
Simple synthetic benchmark:
Result before the change on MI-355X:
After the change:
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang