[MPS] Initialize `MPSDevice::_mtl_device` property to `nil` by malfet · Pull Request #78136 · pytorch/pytorch

malfet · 2022-05-23T23:09:13Z

This prevents import torch accidentally crash on machines with no metal devices

Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble 
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend

which corresponds to [m_device maxBufferLength] call, where m_device is not initialized in

pytorch/aten/src/ATen/mps/MPSAllocator.h

Line 171 in 2ae3c59

m_total_allocated_memory(0), m_max_buffer_size([m_device maxBufferLength]),

This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171

facebook-github-bot · 2022-05-23T23:09:19Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/78136
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 2 New Failures

As of commit 7d3cb81 (more details on the Dr. CI page):

Expand to see more

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 1, linux.4xlarge.nvidia.gpu) (1/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-24T01:31:14.8993660Z FAIL [2.244s]: tes..._errors_case9_cuda (__main__.TestNNDeviceTypeCUDA)

2022-05-24T01:31:14.8989919Z     return test(*args, **kwargs)
2022-05-24T01:31:14.8990422Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1129, in wrapper
2022-05-24T01:31:14.8990797Z     fn(*args, **kwargs)
2022-05-24T01:31:14.8991304Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 979, in only_fn
2022-05-24T01:31:14.8991690Z     return fn(self, *args, **kwargs)
2022-05-24T01:31:14.8992075Z   File "/var/lib/jenkins/workspace/test/test_nn.py", line 15136, in test_MaxUnpool_index_errors
2022-05-24T01:31:14.8992432Z     self.assertIn(
2022-05-24T01:31:14.8992870Z AssertionError: b'Assertion `index >= 0 && index < outputImageSize` failed' not found in b''
2022-05-24T01:31:14.8993138Z 
2022-05-24T01:31:14.8993281Z ======================================================================
2022-05-24T01:31:14.8993660Z FAIL [2.244s]: test_MaxUnpool_index_errors_case9_cuda (__main__.TestNNDeviceTypeCUDA)
2022-05-24T01:31:14.8994166Z ----------------------------------------------------------------------
2022-05-24T01:31:14.8994503Z Traceback (most recent call last):
2022-05-24T01:31:14.8995026Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1808, in wrapper
2022-05-24T01:31:14.8995417Z     method(*args, **kwargs)
2022-05-24T01:31:14.8995901Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1808, in wrapper
2022-05-24T01:31:14.8996278Z     method(*args, **kwargs)
2022-05-24T01:31:14.8996900Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 377, in instantiated_test
2022-05-24T01:31:14.8997306Z     result = test(self, **param_kwargs)
2022-05-24T01:31:14.8997833Z   File "/opt/conda/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 340, in test_wrapper
2022-05-24T01:31:14.8998230Z     return test(*args, **kwargs)

trunk / macos-11-py3-x86-64 / test (default, 2, 2, macos-12) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-24T02:14:43.8160100Z FAIL [1.363s]: tes...ented_with_fallback (__main__.TestFallbackWarning)

2022-05-24T02:14:43.8158440Z ERROR [0.016s]: test_smooth_l1_loss_reduction_sum (__main__.TestSmoothL1Loss)
2022-05-24T02:14:43.8158710Z ----------------------------------------------------------------------
2022-05-24T02:14:43.8158830Z Traceback (most recent call last):
2022-05-24T02:14:43.8158980Z   File "test_mps.py", line 1309, in test_smooth_l1_loss_reduction_sum
2022-05-24T02:14:43.8159120Z     self._smooth_l1_loss_helper(reduction="sum")
2022-05-24T02:14:43.8159260Z   File "test_mps.py", line 1287, in _smooth_l1_loss_helper
2022-05-24T02:14:43.8159640Z     input_mps = input_cpu.detach().clone().to('mps').requires_grad_()
2022-05-24T02:14:43.8159780Z RuntimeError: Invalid buffer size: 112 bytes
2022-05-24T02:14:43.8159780Z 
2022-05-24T02:14:43.8159900Z ======================================================================
2022-05-24T02:14:43.8160100Z FAIL [1.363s]: test_warn_on_not_implemented_with_fallback (__main__.TestFallbackWarning)
2022-05-24T02:14:43.8160380Z ----------------------------------------------------------------------
2022-05-24T02:14:43.8160500Z Traceback (most recent call last):
2022-05-24T02:14:43.8160670Z   File "test_mps.py", line 4086, in test_warn_on_not_implemented_with_fallback
2022-05-24T02:14:43.8160780Z     subprocess.check_output(
2022-05-24T02:14:43.8162070Z subprocess.CalledProcessError: Command '['/Users/runner/miniconda3/envs/build/bin/python', '-W', 'all', '-c', '\nimport os\n# MUST happen before pytorch\'s import\nos.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"\nimport warnings\n\nwith warnings.catch_warnings(record=True) as w:\n    import torch\n\nif len(w) > 0:\n    exit(1)\n\n# This should run just fine and raise warning about perf\nwith warnings.catch_warnings(record=True) as w:\n    torch.eye(2, device=\'mps\')\n\nif len(w) != 1:\n    exit(2)\n\n']' returned non-zero exit status 1.
2022-05-24T02:14:43.8162300Z 
2022-05-24T02:14:43.8162520Z During handling of the above exception, another exception occurred:
2022-05-24T02:14:43.8162520Z 
2022-05-24T02:14:43.8162640Z Traceback (most recent call last):
2022-05-24T02:14:43.8162810Z   File "test_mps.py", line 4094, in test_warn_on_not_implemented_with_fallback

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

malfet · 2022-05-24T00:40:07Z

@pytorchbot merge this

pytorchmergebot · 2022-05-24T00:41:35Z

Merge failed due to Refusing to merge as mandatory check(s) linux-docs / build-docs (cpp) are pending/not yet run for rule superuser
Raised by https://github.com/pytorch/pytorch/actions/runs/2374761372

malfet · 2022-05-24T02:06:54Z

@pytorchbot merge this

github-actions · 2022-05-24T02:08:39Z

Hey @malfet.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…78136) This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread pytorch#1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame pytorch#2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame pytorch#3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame pytorch#4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: pytorch#78136 Approved by: https://github.com/seemethere

…78204) This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: #78136 Approved by: https://github.com/seemethere Co-authored-by: Nikita Shulga <nshulga@fb.com>

…78136) Summary: This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: #78136 Approved by: https://github.com/seemethere Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/c7ce4fcc619fab5c82071eac934b505b396ee015 Reviewed By: mehtanirav Differential Revision: D36633657 Pulled By: malfet fbshipit-source-id: 535c94ab2ef7cc80e22a5086e0d12453a899f53a

This prevents `import torch` accidentally crash on machines with no metal devices Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true Backtrace to the crash: ``` (lldb) bt * thread #1, stop reason = signal SIGSTOP * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23 frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125 frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535 frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl: -> 0x10fd9f524 <+436>: movq %rax, 0x1b0(%rbx) 0x10fd9f52b <+443>: movw $0x0, 0x1b8(%rbx) 0x10fd9f534 <+452>: addq $0x8, %rsp 0x10fd9f538 <+456>: popq %rbx (lldb) disassemble ... 0x10fd9f514 <+420>: movq 0xf19ad15(%rip), %rsi ; "maxBufferLength" 0x10fd9f51b <+427>: movq %r14, %rdi 0x10fd9f51e <+430>: callq *0xeaa326c(%rip) ; (void *)0x00007fff7202be40: objc_msgSend ``` which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171 Pull Request resolved: #78136 Approved by: https://github.com/seemethere

facebook-github-bot added the cla signed label May 23, 2022

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label May 23, 2022

seemethere approved these changes May 23, 2022

View reviewed changes

malfet added the topic: bug fixes topic category label May 24, 2022

pytorchmergebot added the Merged label May 24, 2022

pytorchmergebot closed this in c7ce4fc May 24, 2022

malfet deleted the malfet-patch-11 branch May 24, 2022 02:09

pmeier mentioned this pull request May 24, 2022

MacOS unittest and binary jobs failing pytorch/vision#6070

Closed

This was referenced May 24, 2022

[MPS] Initialize MPSDevice::_mtl_device property to nil (#78136) #78204

Merged

[v.1.12.0] Release Tracker #78005

Closed

malfet mentioned this pull request May 25, 2022

Move x86 binaries builder to macos-12 to enable MPS build #77662

Closed

malfet mentioned this pull request Aug 1, 2022

[MPS] Register index.Tensor_out #82507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPS] Initialize `MPSDevice::_mtl_device` property to `nil`#78136

[MPS] Initialize `MPSDevice::_mtl_device` property to `nil`#78136
malfet wants to merge 1 commit intomasterfrom
malfet-patch-11

malfet commented May 23, 2022

Uh oh!

facebook-github-bot commented May 23, 2022 •

edited

Loading

🕵️ 2 new failures recognized by patterns

trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 1, linux.4xlarge.nvidia.gpu) (1/2)

trunk / macos-11-py3-x86-64 / test (default, 2, 2, macos-12) (2/2)

Uh oh!

malfet commented May 24, 2022

Uh oh!

pytorchmergebot commented May 24, 2022

Uh oh!

malfet commented May 24, 2022

Uh oh!

github-actions bot commented May 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

malfet commented May 23, 2022

Uh oh!

facebook-github-bot commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 2 New Failures

🕵️ 2 new failures recognized by patterns

trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (slow, 1, 1, linux.4xlarge.nvidia.gpu) (1/2)

trunk / macos-11-py3-x86-64 / test (default, 2, 2, macos-12) (2/2)

Uh oh!

malfet commented May 24, 2022

Uh oh!

pytorchmergebot commented May 24, 2022

Uh oh!

malfet commented May 24, 2022

Uh oh!

github-actions bot commented May 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

facebook-github-bot commented May 23, 2022 •

edited

Loading