[MPS] Initialize `MPSDevice::_mtl_device` property to `nil` (#78136) by atalman · Pull Request #78204 · pytorch/pytorch

atalman · 2022-05-24T20:54:29Z

This prevents import torch accidentally crash on machines with no metal devices

Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend

which corresponds to [m_device maxBufferLength] call, where m_device is not initialized in

pytorch/aten/src/ATen/mps/MPSAllocator.h

Line 171 in 2ae3c59

m_total_allocated_memory(0), m_max_buffer_size([m_device maxBufferLength]),

Pull Request resolved: #78136
Approved by: https://github.com/seemethere

…78136) This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread pytorch#1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame pytorch#2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame pytorch#3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame pytorch#4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: pytorch#78136 Approved by: https://github.com/seemethere

facebook-github-bot · 2022-05-24T20:54:36Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/78204
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 1 New Failures

As of commit affdb65 (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-24T22:57:45.8325649Z RuntimeError: test_sparse_csr failed!

2022-05-24T22:57:42.3587604Z 
2022-05-24T22:57:42.3587865Z Generating XML reports...
2022-05-24T22:57:42.6221853Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCSRCUDA-20220524225709.xml
2022-05-24T22:57:42.6223824Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCSRSampler-20220524225709.xml
2022-05-24T22:57:42.6793143Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCompressedCUDA-20220524225709.xml
2022-05-24T22:57:45.8312019Z Traceback (most recent call last):
2022-05-24T22:57:45.8312826Z   File "test/run_test.py", line 1074, in <module>
2022-05-24T22:57:45.8318550Z     main()
2022-05-24T22:57:45.8319225Z   File "test/run_test.py", line 1052, in main
2022-05-24T22:57:45.8324897Z     raise RuntimeError(err_message)
2022-05-24T22:57:45.8325649Z RuntimeError: test_sparse_csr failed!
2022-05-24T22:57:47.8139845Z 
2022-05-24T22:57:47.8140864Z real	42m42.613s
2022-05-24T22:57:47.8141481Z user	77m7.536s
2022-05-24T22:57:47.8142019Z sys	46m11.644s
2022-05-24T22:57:47.8142571Z + cleanup
2022-05-24T22:57:47.8143108Z + retcode=1
2022-05-24T22:57:47.8143656Z + set +x
2022-05-24T22:57:47.8268985Z ##[error]Process completed with exit code 1.
2022-05-24T22:57:47.8346166Z ##[group]Run # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
2022-05-24T22:57:47.8347160Z �[36;1m# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct�[0m

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot added the cla signed label May 24, 2022

atalman mentioned this pull request May 24, 2022

[v.1.12.0] Release Tracker #78005

Closed

malfet approved these changes May 24, 2022

View reviewed changes

atalman merged commit 2ad18ab into pytorch:release/1.12 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPS] Initialize `MPSDevice::_mtl_device` property to `nil` (#78136)#78204

[MPS] Initialize `MPSDevice::_mtl_device` property to `nil` (#78136)#78204
atalman merged 1 commit intopytorch:release/1.12from
atalman:cherry_pick_3

atalman commented May 24, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented May 24, 2022 •

edited

Loading

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

atalman commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 1 New Failures

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

atalman commented May 24, 2022 •

edited

Loading

facebook-github-bot commented May 24, 2022 •

edited

Loading