[ROCm] Directly access scalars if largeBar is enabled by pschlan-amd · Pull Request #177023 · pytorch/pytorch

pschlan-amd · 2026-03-10T13:14:33Z

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy.

Benchmark

Simple synthetic benchmark:

import torch
import time

device = torch.device("cuda")

N = 1000000
tensor = torch.arange(N)
tensor = tensor.to(device)

start = time.time()
for i in range(0, N):
    x = tensor[i].item()
    assert(x == i)
end = time.time()

print('retrieving {} scalars took {} s'.format(N, end - start))

Result before the change on MI-355X:

# python test-noprof.py
retrieving 1000000 scalars took 15.402856349945068 s

After the change:

# python test-noprof.py
retrieving 1000000 scalars took 2.8192553520202637 s

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

pytorch-bot · 2026-03-10T13:14:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177023

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b5351d6 with merge base c8917e2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-03-10T13:14:40Z

The committers listed above are authorized under a signed CLA.

✅ login: pschlan-amd / name: Patrick Schlangen (3d1dc0e, 75d2d54, b5351d6)

jeffdaily · 2026-03-11T22:30:02Z

Need to wait on this PR until #173330 is relanded.

jeffdaily · 2026-03-25T18:10:51Z

@pytorchbot merge

pytorchmergebot · 2026-03-25T18:13:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-25T19:48:25Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2022-cuda12.8-py3 / build

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2026-03-26T14:26:46Z

@pytorchbot merge

pytorchmergebot · 2026-03-26T14:29:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy. ## Benchmark Simple synthetic benchmark: ```python import torch import time device = torch.device("cuda") N = 1000000 tensor = torch.arange(N) tensor = tensor.to(device) start = time.time() for i in range(0, N): x = tensor[i].item() assert(x == i) end = time.time() print('retrieving {} scalars took {} s'.format(N, end - start)) ``` Result before the change on MI-355X: ``` # python test-noprof.py retrieving 1000000 scalars took 15.402856349945068 s ``` After the change: ``` # python test-noprof.py retrieving 1000000 scalars took 2.8192553520202637 s ``` Pull Request resolved: #177023 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy. ## Benchmark Simple synthetic benchmark: ```python import torch import time device = torch.device("cuda") N = 1000000 tensor = torch.arange(N) tensor = tensor.to(device) start = time.time() for i in range(0, N): x = tensor[i].item() assert(x == i) end = time.time() print('retrieving {} scalars took {} s'.format(N, end - start)) ``` Result before the change on MI-355X: ``` # python test-noprof.py retrieving 1000000 scalars took 15.402856349945068 s ``` After the change: ``` # python test-noprof.py retrieving 1000000 scalars took 2.8192553520202637 s ``` Pull Request resolved: pytorch#177023 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy. ## Benchmark Simple synthetic benchmark: ```python import torch import time device = torch.device("cuda") N = 1000000 tensor = torch.arange(N) tensor = tensor.to(device) start = time.time() for i in range(0, N): x = tensor[i].item() assert(x == i) end = time.time() print('retrieving {} scalars took {} s'.format(N, end - start)) ``` Result before the change on MI-355X: ``` # python test-noprof.py retrieving 1000000 scalars took 15.402856349945068 s ``` After the change: ``` # python test-noprof.py retrieving 1000000 scalars took 2.8192553520202637 s ``` Pull Request resolved: #177023 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

When using ROCm and the device has largeBar enabled, we can read scalars directly from GPU memory. That way, there's no need for a malloc and memcpy. ## Benchmark Simple synthetic benchmark: ```python import torch import time device = torch.device("cuda") N = 1000000 tensor = torch.arange(N) tensor = tensor.to(device) start = time.time() for i in range(0, N): x = tensor[i].item() assert(x == i) end = time.time() print('retrieving {} scalars took {} s'.format(N, end - start)) ``` Result before the change on MI-355X: ``` # python test-noprof.py retrieving 1000000 scalars took 15.402856349945068 s ``` After the change: ``` # python test-noprof.py retrieving 1000000 scalars took 2.8192553520202637 s ``` Pull Request resolved: pytorch#177023 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

pytorch-bot Bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Mar 10, 2026

pschlan-amd force-pushed the rocm_scalar_largebar branch from 77d818d to 3e51a29 Compare March 10, 2026 13:15

pytorchbot added the open source label Mar 10, 2026

pschlan-amd marked this pull request as ready for review March 10, 2026 13:40

pschlan-amd requested review from Aidyn-A, eqy and syed-ahmed as code owners March 10, 2026 13:40

pschlan-amd closed this Mar 11, 2026

pschlan-amd reopened this Mar 11, 2026

[ROCm] Directly access scalars if largeBar is enabled

107ced3

pschlan-amd force-pushed the rocm_scalar_largebar branch from 3e51a29 to 107ced3 Compare March 11, 2026 09:22

pytorch-bot Bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-navi31 Trigger "default" config CI on ROCm Navi31 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Mar 11, 2026

Fix expandable segment support

40a2e0d

Merge branch 'main' into rocm_scalar_largebar

a1c16c3

soulitzer requested a review from jeffdaily March 16, 2026 21:58

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 16, 2026

Fix build error

453b05b

jeffdaily added the ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 label Mar 24, 2026

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 25, 2026

pytorchmergebot added the merging label Mar 25, 2026

pytorchmergebot removed the merging label Mar 25, 2026

Refactor to fix Windows build issues

223a042

pschlan-amd added 3 commits March 26, 2026 11:27

Merge remote-tracking branch 'origin/main' into rocm_scalar_largebar

3d1dc0e

Fix scalar_t not being defined

75d2d54

Fix build error

b5351d6

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 26, 2026

pytorchmergebot added the merging label Mar 26, 2026

pytorchmergebot added the Merged label Mar 26, 2026

pytorchmergebot closed this in c835ce3 Mar 26, 2026

pytorchmergebot removed the merging label Mar 26, 2026

jeffdaily mentioned this pull request Apr 3, 2026

[ROCm] Fix MIOpen CTC loss crash on Windows #179264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Directly access scalars if largeBar is enabled#177023

[ROCm] Directly access scalars if largeBar is enabled#177023
pschlan-amd wants to merge 12 commits intopytorch:mainfrom
pschlan-amd:rocm_scalar_largebar

pschlan-amd commented Mar 10, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 25, 2026

Uh oh!

pytorchmergebot commented Mar 25, 2026

Uh oh!

pytorchmergebot commented Mar 25, 2026

Uh oh!

jeffdaily commented Mar 26, 2026

Uh oh!

pytorchmergebot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pschlan-amd commented Mar 10, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

pytorch-bot Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177023

✅ No Failures

Uh oh!

linux-foundation-easycla Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Mar 11, 2026

Uh oh!

jeffdaily commented Mar 25, 2026

Uh oh!

pytorchmergebot commented Mar 25, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 25, 2026

Merge failed

Uh oh!

jeffdaily commented Mar 26, 2026

Uh oh!

pytorchmergebot commented Mar 26, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pschlan-amd commented Mar 10, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Mar 10, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Mar 10, 2026 •

edited

Loading