[pytorch][mobile] support for custom mobile build with dynamic dispatch by ljk53 · Pull Request #34055 · pytorch/pytorch

ljk53 · 2020-03-02T09:16:30Z

Stack from ghstack:

[pytorch][CI] add e2e mobile custom build jobs to CI #34184 [pytorch][CI] add e2e mobile custom build jobs to CI
[pytorch][mobile] support for custom mobile build with dynamic dispatch #34055 [pytorch][mobile] support for custom mobile build with dynamic dispatch

Summary:
Enable custom mobile build with dynamic dispatch for OSS build.

It calls a python util script to calculate transitive dependencies from
the op dependency graph and the list of used root ops, then pass the
result as the op registration whitelist to aten codegen, so that only
these used ops are registered and kept at link time.

For custom build with dynamic dispatch to work correctly, it's critical
to have the accurate list of used ops. Current assumption is that only
those ops referenced by TorchScript model are used. It works well if
client code doesn't call libtorch API (e.g. tensor methods) directly;
otherwise the extra used ops need to be added to the whitelist manually,
as shown by the HACK in prepare_model.py.

Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well.

Verified the correctness of the whole process with MobileNetV2:

TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh

Differential Revision: D20193327

Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` [ghstack-poisoned]

…amic dispatch" Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` [ghstack-poisoned]

Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` ghstack-source-id: d571825 Pull Request resolved: #34055

dr-ci · 2020-03-02T09:40:36Z

💊 CircleCI build failures summary and remediations

As of commit 6cf4d94 (more details on the Dr. CI page):

1/6 failures introduced in this PR
3/6 tentatively recognized as flaky ❄️
- Click here to rerun these jobs
2/6 broken upstream at merge base 99e211e since Mar 04
Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch origin viable/strict
git rebase --onto viable/strict $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch origin viable/strict
git rebase viable/strict
```
Check out the recency history of this "viable master" tracking branch.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages:

pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32 (1/1)

Step: "pytorch android gradle build only x86_32 (for PR)" (full log | pattern match details)

Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILED: ../../../../build/intermediates/cmake/release/obj/x86/libpytorch_jni.so

Mar 04 01:51:16 01:51:16.482 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 	at org.gradle.internal.operations.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:91) 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 	at org.gradle.internal.operations.DelegatingBuildOperationExecutor.run(DelegatingBuildOperationExecutor.java:31) 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 	at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeAction(ExecuteActionsTaskExecuter.java:120) 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 	at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:99) 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 	... 31 more 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Caused by: Build command failed. 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Error while executing process /usr/local/bin/cmake with arguments {--build /var/lib/jenkins/workspace/android/pytorch_android/.externalNativeBuild/cmake/release/x86 --target pytorch_jni} 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] [1/3] Building CXX object CMakeFiles/pytorch_jni.dir/src/main/cpp/pytorch_jni_jit.cpp.o 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] [2/3] Building CXX object CMakeFiles/pytorch_jni.dir/src/main/cpp/pytorch_jni_common.cpp.o 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] [3/3] Linking CXX shared library ../../../../build/intermediates/cmake/release/obj/x86/libpytorch_jni.so 
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILED: ../../../../build/intermediates/cmake/release/obj/x86/libpytorch_jni.so  
Mar 04 01:51:16 01:51:16.483 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] 01:51:16.484 [LIFECYCLE] [org.gradle.internal.buildevents.BuildResultLogger] 45 actionable tasks: 45 executed 
Mar 04 01:51:16 01:51:16.499 [INFO] [org.gradle.api.Task] AAPT2 aapt2-3.3.2-5309881-linux Daemon #0: shutdown 
Mar 04 01:51:16 01:51:16.504 [DEBUG] [org.gradle.internal.work.DefaultWorkerLeaseService] Worker lease root.1 completed (0 worker(s) in use) 
Mar 04 01:51:16 01:51:16.504 [DEBUG] [org.gradle.internal.resources.AbstractTrackedResourceLock] Daemon worker: released lock on root.1 
Mar 04 01:51:16 01:51:16.504 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Completing Build operation 'Run build' 
Mar 04 01:51:16 01:51:16.505 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Build operation 'Run build' completed 
Mar 04 01:51:16 01:51:16.523 [DEBUG] [org.gradle.cache.internal.LockOnDemandCrossProcessCacheAccess] Releasing file lock for file content cache (/var/lib/jenkins/workspace/android/.gradle/4.10.3/fileContent) 
Mar 04 01:51:16 01:51:16.523 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on file content cache (/var/lib/jenkins/workspace/android/.gradle/4.10.3/fileContent). 
Mar 04 01:51:16 01:51:16.526 [DEBUG] [org.gradle.cache.internal.LockOnDemandCrossProcessCacheAccess] Releasing file lock for task history cache (/var/lib/jenkins/workspace/android/.gradle/4.10.3/taskHistory) 
Mar 04 01:51:16 01:51:16.527 [DEBUG] [org.gradle.cache.internal.btree.BTreePersistentIndexedCache] Closing cache taskHistory.bin (/var/lib/jenkins/workspace/android/.gradle/4.10.3/taskHistory/taskHistory.bin)

❄️ 3 tentatively flaky failures

3 failures tentatively classified as flaky but have not launched reruns to confirm:

pytorch_linux_xenial_cuda10_1_cudnn7_py3_slow_test (1/3)

Step: "Test" (full log | pattern match details) ❄️

Mar 04 02:12:25 RuntimeError: Error downloading resource!

Mar 04 02:12:25  
Mar 04 02:12:25 During handling of the above exception, another exception occurred: 
Mar 04 02:12:25  
Mar 04 02:12:25 Traceback (most recent call last): 
Mar 04 02:12:25   File "tools/download_mnist.py", line 87, in <module> 
Mar 04 02:12:25     main() 
Mar 04 02:12:25   File "tools/download_mnist.py", line 80, in main 
Mar 04 02:12:25     download(path, url, options.quiet) 
Mar 04 02:12:25   File "tools/download_mnist.py", line 41, in download 
Mar 04 02:12:25     raise RuntimeError('Error downloading resource!') 
Mar 04 02:12:25 RuntimeError: Error downloading resource! 
Mar 04 02:12:25 + cleanup 
Mar 04 02:12:25 + retcode=1 
Mar 04 02:12:25 + set +x 
Mar 04 02:12:25 =================== sccache compilation log =================== 
Mar 04 02:12:26 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/tmp/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Mar 04 02:12:26  
Mar 04 02:12:26 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Mar 04 02:12:26 Compile requests               161 
Mar 04 02:12:26 Compile requests executed       57 
Mar 04 02:12:26 Cache hits                      46

pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (2/3)

Step: "Test" (full log | pattern match details) ❄️

Mar 04 02:21:29 RuntimeError: Error downloading resource!

Mar 04 02:21:29  
Mar 04 02:21:29 During handling of the above exception, another exception occurred: 
Mar 04 02:21:29  
Mar 04 02:21:29 Traceback (most recent call last): 
Mar 04 02:21:29   File "tools/download_mnist.py", line 87, in <module> 
Mar 04 02:21:29     main() 
Mar 04 02:21:29   File "tools/download_mnist.py", line 80, in main 
Mar 04 02:21:29     download(path, url, options.quiet) 
Mar 04 02:21:29   File "tools/download_mnist.py", line 41, in download 
Mar 04 02:21:29     raise RuntimeError('Error downloading resource!') 
Mar 04 02:21:29 RuntimeError: Error downloading resource! 
Mar 04 02:21:29 + cleanup 
Mar 04 02:21:29 + retcode=1 
Mar 04 02:21:29 + set +x 
Mar 04 02:21:29 =================== sccache compilation log =================== 
Mar 04 02:21:29 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/tmp/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/tmp/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Mar 04 02:21:29  
Mar 04 02:21:29 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Mar 04 02:21:29 Compile requests                 68 
Mar 04 02:21:29 Compile requests executed        28 
Mar 04 02:21:29 Cache hits                       21

pytorch_linux_xenial_cuda10_1_cudnn7_py3_NO_AVX2_test (3/3)

Step: "Test" (full log | pattern match details) ❄️

Mar 04 02:57:53 RuntimeError: Error downloading resource!

Mar 04 02:57:53  
Mar 04 02:57:53 During handling of the above exception, another exception occurred: 
Mar 04 02:57:53  
Mar 04 02:57:53 Traceback (most recent call last): 
Mar 04 02:57:53   File "tools/download_mnist.py", line 87, in <module> 
Mar 04 02:57:53     main() 
Mar 04 02:57:53   File "tools/download_mnist.py", line 80, in main 
Mar 04 02:57:53     download(path, url, options.quiet) 
Mar 04 02:57:53   File "tools/download_mnist.py", line 41, in download 
Mar 04 02:57:53     raise RuntimeError('Error downloading resource!') 
Mar 04 02:57:53 RuntimeError: Error downloading resource! 
Mar 04 02:57:53 + cleanup 
Mar 04 02:57:53 + retcode=1 
Mar 04 02:57:53 + set +x 
Mar 04 02:57:53 =================== sccache compilation log =================== 
Mar 04 02:57:53 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Mar 04 02:57:53 Compile requests                19 
Mar 04 02:57:53 Compile requests executed        0 
Mar 04 02:57:53 Cache hits                       0 
Mar 04 02:57:53 Cache misses                     0 
Mar 04 02:57:53 Cache timeouts                   0

🚧 2 upstream failures recognized by patterns:

These were probably caused by upstream breakages:

pytorch_linux_xenial_cuda10_1_cudnn7_py3_multigpu_test from Mar 04 until Mar 04
pytorch_xla_linux_xenial_py3_6_clang7_test from Mar 04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

iseeyuan · 2020-03-02T17:37:48Z

"Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well." I just noticed there are some extra ops in the actual bytecode than from the get_opnames. Do you mean this? I could try some fixes on get_opnames.

bhosmer · 2020-03-02T19:26:40Z


 # Dump root ops used by the model (for custom build optimization).
 ops = torch.jit.export_opnames(traced_script_module)
+ops.append('aten::ones')  # HACK because predictor.cpp explicitly calls this!


Just a thought that if our eventual setup still needs to explicitly push roots into ops like this we'll probably want to institutionalize it, like have a global list called EXTRA_ROOT_OPS or whatever

Yes, as I replied to iseeyuan, we need run static analysis against JIT code to find out common EXTRA_ROOT_OPS, if any.

However, this case is different - aten::ones is directly called from client code (the dummy predictor.cpp calls it to create all-one-tensor for testing purpose). Generally, user could write pure c++ client code to directly call functions or tensor methods, which won't be captured by extracting ops from model's bytecodes.

This is less a problem for Android, where we expect users to use the limited set of Java APIs. To actually solve this problem, we probably need ask users to run code analyzer against their client code to dump these extra root ops - that's why I feel there will be more support work to switch to dynamic dispatch - in static dispatch case these extra ops will be kept by linker automatically.

For CI purpose this one-off hack is probably fine? :)

Question, does this mean every custom build is gonna get aten::ones whether they use it or not? Or just binaries that use the dummy predictor.cpp? (I'm guessing the former since this is a general script, but plz correct :)

A single op isn't that big a deal I guess, but I think it's important to describe the situation more explicitly - something like your description above in a comment above a global that declares the op more explicitly. Text below is just your comment above, so may not be entirely correct, but this is the kind of thing I mean:

# We need to run static analysis against JIT code to find out common EXTRA_ROOT_OPS, if any. # # However, this case is different - aten::ones is directly called from client code (the dummy # predictor.cpp calls it to create all-one-tensor for testing purpose). Generally, user could # write pure c++ client code to directly call functions or tensor methods, which won't be # captured by extracting ops from model's bytecodes. # # This is less a problem for Android, where we expect users to use the limited set of Java # APIs. To actually solve this problem, we probably need ask users to run code analyzer against # their client code to dump these extra root ops - that's why I feel there will be more support # work to switch to dynamic dispatch - in static dispatch case these extra ops will be kept by # linker automatically. # # For CI purpose this one-off hack is probably fine? :) EXTRA_CI_ROOT_OPS = ['aten::ones'] ... ops.extend(EXTRA_CI_ROOT_OPS)

Question, does this mean every custom build is gonna get aten::ones whether they use it or not? Or just binaries that use the dummy predictor.cpp? (I'm guessing the former since this is a general script, but plz correct :)

It's the latter - the script is under test/mobile/... which is for mobile CI specifically. That's why I don't feel so guilty to add this hack :)

Added the comment to the test script to keep a record.

…amic dispatch" Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` Differential Revision: [D20193327](https://our.internmc.facebook.com/intern/diff/D20193327) [ghstack-poisoned]

Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` ghstack-source-id: c94a3db Pull Request resolved: #34055

ljk53 · 2020-03-03T05:59:43Z

"Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well." I just noticed there are some extra ops in the actual bytecode than from the get_opnames. Do you mean this? I could try some fixes on get_opnames.

The problem you mentioned should be fixed (regardless of static dispatch custom build or dynamic dispatch custom build) :)

What I meant was potential case where model's bytecode doesn't fully cover all ops called from JIT, e.g. what if during model loading time some JIT code decides to call some ops directly? I can imagine some common preprocessing pass might do this.

One way to address this problem is to run static analysis against JIT codebase and find out. It will require a little work.

iseeyuan · 2020-03-03T16:03:38Z

"Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well." I just noticed there are some extra ops in the actual bytecode than from the get_opnames. Do you mean this? I could try some fixes on get_opnames.

The problem you mentioned should be fixed (regardless of static dispatch custom build or dynamic dispatch custom build) :)

What I meant was potential case where model's bytecode doesn't fully cover all ops called from JIT, e.g. what if during model loading time some JIT code decides to call some ops directly? I can imagine some common preprocessing pass might do this.

One way to address this problem is to run static analysis against JIT codebase and find out. It will require a little work.

Are load-time operations embedded in __setstate__, which can also be collected in bytecode level? I'm not aware of other run-time op calls.

bhosmer

One inline thing about the commenting aten::ones dependency but LGTM :)

…amic dispatch" Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` Differential Revision: [D20193327](https://our.internmc.facebook.com/intern/diff/D20193327) [ghstack-poisoned]

Summary: Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` ghstack-source-id: 9bf078d Pull Request resolved: #34055

mrshenli · 2020-03-04T04:49:42Z

Hey @ljk53, looks like either this PR or the one below it has broken master. Do we have any fix coming for it?

https://app.circleci.com/jobs/github/pytorch/pytorch/4683824
https://app.circleci.com/jobs/github/pytorch/pytorch/4683798

facebook-github-bot · 2020-03-04T05:17:40Z

@ljk53 merged this pull request in 3c042a6.

ljk53 · 2020-03-04T05:34:52Z

@mrshenli - looking - this might affect build but shouldn't break test.

…c dispatch (#34055)" This reverts commit 3c042a6. [ghstack-poisoned]

…c dispatch (#34055)" This reverts commit 3c042a6. ghstack-source-id: 6a1b1f7 Pull Request resolved: #34189

ljk53 · 2020-03-04T07:53:45Z

@mrshenli created #34188 and #34189 to revert #32814 and this PR, respectively: none of them fixed the failed CI :(

mrshenli · 2020-03-04T15:17:45Z

@ljk53 Thanks a lot for investigating. Looks like that test is flaky. It also fails on #33860 but succeeded in several subsequent runs. Let me disable that.

…ch (pytorch#34055) Summary: Pull Request resolved: pytorch#34055 Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D20193327 Pulled By: ljk53 fbshipit-source-id: 9d369b8864856b098342aea79e0ac8eec04149aa

ljk53 · 2020-03-13T04:56:28Z

"Also, if JIT starts calling extra ops independent of specific model,
then the extra ops need to be added to the whitelist as well." I just noticed there are some extra ops in the actual bytecode than from the get_opnames. Do you mean this? I could try some fixes on get_opnames.

The problem you mentioned should be fixed (regardless of static dispatch custom build or dynamic dispatch custom build) :)
What I meant was potential case where model's bytecode doesn't fully cover all ops called from JIT, e.g. what if during model loading time some JIT code decides to call some ops directly? I can imagine some common preprocessing pass might do this.
One way to address this problem is to run static analysis against JIT codebase and find out. It will require a little work.

Are load-time operations embedded in __setstate__, which can also be collected in bytecode level? I'm not aware of other run-time op calls.

@iseeyuan Here is one example of calling ops during model loading time: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/serialization/unpickler.cpp#L425

PickleOpCode Unpickler::readInstruction() {
  ...
  case PickleOpCode::BINPERSID: {
    ..
    tensor = at::empty({0}, options).set_(storage);

It calls aten::empty() and aten::set_().

Today it works because these ops are pretty common and are also indirectly called by regular models - but there is no guarantee it will keep this way. For example, #34641 breaks dynamic dispatch CI in a subtle way. We probably need run code analyzer against torch/csrc to find out these common ops used by JIT.

…ch (pytorch#34055) Summary: Pull Request resolved: pytorch#34055 Enable custom mobile build with dynamic dispatch for OSS build. It calls a python util script to calculate transitive dependencies from the op dependency graph and the list of used root ops, then pass the result as the op registration whitelist to aten codegen, so that only these used ops are registered and kept at link time. For custom build with dynamic dispatch to work correctly, it's critical to have the accurate list of used ops. Current assumption is that only those ops referenced by TorchScript model are used. It works well if client code doesn't call libtorch API (e.g. tensor methods) directly; otherwise the extra used ops need to be added to the whitelist manually, as shown by the HACK in prepare_model.py. Also, if JIT starts calling extra ops independent of specific model, then the extra ops need to be added to the whitelist as well. Verified the correctness of the whole process with MobileNetV2: ``` TEST_CUSTOM_BUILD_DYNAMIC=1 test/mobile/custom_build/build.sh ``` Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D20193327 Pulled By: ljk53 fbshipit-source-id: 9d369b8864856b098342aea79e0ac8eec04149aa

This was referenced Mar 2, 2020

[pytorch][CI] end-to-end custom build script #34012

Closed

[pytorch][mobile] make sure mobile build work with dynamic dispatch #34038

Closed

ljk53 requested review from bhosmer, dreiss, ezyang and iseeyuan March 2, 2020 09:26

bhosmer reviewed Mar 2, 2020

View reviewed changes

ljk53 requested a review from bhosmer March 3, 2020 06:12

bhosmer approved these changes Mar 3, 2020

View reviewed changes

ljk53 mentioned this pull request Mar 4, 2020

[pytorch][CI] add e2e mobile custom build jobs to CI #34184

Closed

facebook-github-bot closed this in 3c042a6 Mar 4, 2020

facebook-github-bot added the merged label Mar 4, 2020

ljk53 mentioned this pull request Mar 4, 2020

[quant][graphmode] Skip quantizing input and output in matched module #32814

Closed

ljk53 added a commit that referenced this pull request Mar 4, 2020

Revert "[pytorch][mobile] support for custom mobile build with dynami…

d38a75e

…c dispatch (#34055)" This reverts commit 3c042a6. [ghstack-poisoned]

ljk53 mentioned this pull request Mar 4, 2020

Revert "[pytorch][mobile] support for custom mobile build with dynamic dispatch (#34055)" #34189

Closed

ljk53 added a commit that referenced this pull request Mar 4, 2020

Revert "[pytorch][mobile] support for custom mobile build with dynami…

972854b

…c dispatch (#34055)" This reverts commit 3c042a6. ghstack-source-id: 6a1b1f7 Pull Request resolved: #34189

mrshenli mentioned this pull request Mar 4, 2020

DISABLED test_script_pack_padded_sequence (test_jit.TestScript) #34213

Closed

facebook-github-bot deleted the gh/ljk53/110/head branch March 7, 2020 15:18

mruberry added the Merged label Oct 28, 2020

Conversation

ljk53 commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32 (1/1)

❄️ 3 tentatively flaky failures

pytorch_linux_xenial_cuda10_1_cudnn7_py3_slow_test (1/3)

pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (2/3)

pytorch_linux_xenial_cuda10_1_cudnn7_py3_NO_AVX2_test (3/3)

🚧 2 upstream failures recognized by patterns:

Uh oh!

iseeyuan commented Mar 2, 2020

Uh oh!

bhosmer Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

ljk53 Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

bhosmer Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

ljk53 Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

ljk53 Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

ljk53 commented Mar 3, 2020

Uh oh!

iseeyuan commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhosmer left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Mar 4, 2020

Uh oh!

facebook-github-bot commented Mar 4, 2020

Uh oh!

ljk53 commented Mar 4, 2020

Uh oh!

ljk53 commented Mar 4, 2020

Uh oh!

mrshenli commented Mar 4, 2020

Uh oh!

ljk53 commented Mar 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ljk53 commented Mar 2, 2020 •

edited

Loading

dr-ci Bot commented Mar 2, 2020 •

edited

Loading

iseeyuan commented Mar 3, 2020 •

edited

Loading