Skip to content

Added option to update parameters using state_dict in AveragedModel (#65495)#65755

Merged
malfet merged 2 commits intopytorch:release/1.10from
prabhat00155:prabhat00155/cherrypick_pr
Oct 6, 2021
Merged

Added option to update parameters using state_dict in AveragedModel (#65495)#65755
malfet merged 2 commits intopytorch:release/1.10from
prabhat00155:prabhat00155/cherrypick_pr

Conversation

@prabhat00155
Copy link
Contributor

Summary:
While implementing EMA(which extends AveragedModel) in torchvision, update_parameters() from AveragedModel could not be used as it did not handle state_dict(), so a custom update_parameters() needed to be defined in EMA class. This PR aims to handle this scenario removing the need for this custom update_parameters() implementation.

Discussion: pytorch/vision#4406 (review)

Pull Request resolved: #65495

Reviewed By: datumbox

Differential Revision: D31176742

Pulled By: prabhat00155

fbshipit-source-id: 326d14876018f21cf602bab5eaba344678dbabe2
(cherry picked from commit 2ea724b)

Fixes #{issue number}

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 28, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 109d701 (more details on the Dr. CI page):



🕵️ 15 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-py3.8-gcc9-coverage / test (default, 2, 2, linux.2xlarge) (1/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:14:09.2276322Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2021-10-05T18:14:09.2229547Z   /opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py(605): run_tests
2021-10-05T18:14:09.2230602Z   test_futures.py(329): <module>
2021-10-05T18:14:09.2231753Z   /opt/conda/lib/python3.8/site-packages/coverage/execfile.py(247): run
2021-10-05T18:14:09.2233153Z   /opt/conda/lib/python3.8/site-packages/coverage/cmdline.py(746): do_run
2021-10-05T18:14:09.2234586Z   /opt/conda/lib/python3.8/site-packages/coverage/cmdline.py(588): command_line
2021-10-05T18:14:09.2235995Z   /opt/conda/lib/python3.8/site-packages/coverage/cmdline.py(871): main
2021-10-05T18:14:09.2237229Z   /opt/conda/bin/coverage(8): <module>
2021-10-05T18:14:09.2237689Z 
2021-10-05T18:14:09.2238101Z ok (0.002s)
2021-10-05T18:14:09.2264870Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.005s)
2021-10-05T18:14:09.2276322Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2021-10-05T18:14:09.2277736Z ok (0.001s)
2021-10-05T18:14:09.2300256Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.002s)
2021-10-05T18:14:09.2371815Z   test_chained_then (__main__.TestFuture) ... ok (0.007s)
2021-10-05T18:14:09.3405409Z   test_collect_all (__main__.TestFuture) ... ok (0.103s)
2021-10-05T18:14:09.3420988Z   test_done (__main__.TestFuture) ... ok (0.002s)
2021-10-05T18:14:09.3446560Z   test_done_exception (__main__.TestFuture) ... ok (0.002s)
2021-10-05T18:14:09.3481507Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.003s)
2021-10-05T18:14:09.3502663Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2021-10-05T18:14:09.3503988Z 
2021-10-05T18:14:09.3504493Z At:

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (default, 2, 2, linux.8xlarge.nvidia.gpu) (2/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:15.0776264Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:15.0747391Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:15.0769472Z   GITHUB_TOKEN: ***
2021-10-05T18:20:15.0771161Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:15.0773171Z   JOB_BASE_NAME: linux-xenial-cuda11.3-py3.6-gcc7-test
2021-10-05T18:20:15.0774056Z   TEST_CONFIG: default
2021-10-05T18:20:15.0774557Z   SHARD_NUMBER: 2
2021-10-05T18:20:15.0775051Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:15.0775638Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:15.0776264Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:15.0776837Z   GPU_FLAG: --gpus all
2021-10-05T18:20:15.0777324Z   SHM_SIZE: 2g
2021-10-05T18:20:15.0777806Z ##[endgroup]
2021-10-05T18:20:16.0578347Z 66af12523485
2021-10-05T18:20:16.4805795Z Deleted Containers:
2021-10-05T18:20:16.4807044Z 66af12523485db0e26a396e469aae2b5c8349a6200717bc20219a57a3fbf791c
2021-10-05T18:20:16.4807722Z 
2021-10-05T18:20:20.5560938Z Deleted Images:
2021-10-05T18:20:20.5563228Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:20.5566245Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7@sha256:44f979255f5f29448c4ab091295c81e442d5fc85f4c85813fd48198dfec15f0e

See GitHub Actions build linux-bionic-py3.8-gcc9-coverage / test (default, 1, 2, linux.2xlarge) (3/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:06.0030988Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:06.0021877Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:06.0023497Z   GITHUB_TOKEN: ***
2021-10-05T18:20:06.0025193Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.8-gcc9:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:06.0027349Z   JOB_BASE_NAME: linux-bionic-py3.8-gcc9-coverage-test
2021-10-05T18:20:06.0028375Z   TEST_CONFIG: default
2021-10-05T18:20:06.0028937Z   SHARD_NUMBER: 1
2021-10-05T18:20:06.0029476Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:06.0030113Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:06.0030988Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:06.0031563Z   SHM_SIZE: 1g
2021-10-05T18:20:06.0032066Z ##[endgroup]
2021-10-05T18:20:06.5003083Z e20112018260
2021-10-05T18:20:06.8947894Z Deleted Containers:
2021-10-05T18:20:06.8948844Z e201120182601144db1ae48b7930bdbe95f2eaa984122b0441bbe593878b3b70
2021-10-05T18:20:06.8949347Z 
2021-10-05T18:20:08.8324068Z Deleted Images:
2021-10-05T18:20:08.8325791Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine:latest
2021-10-05T18:20:08.8327743Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine@sha256:def822f9851ca422481ec6fee59a9966f12b351c62ccb9aca841526ffaa9f748
2021-10-05T18:20:08.8329704Z deleted: sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec

See GitHub Actions build win-vs2019-cpu-py3 / build (4/15)

Step: "Upload artifacts to s3" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:38.5251423Z ninja: error: remo...dOpKernels.cpp.DEFAULT.cpp.obj): Permission denied
2021-10-05T18:20:38.1447853Z Terminate batch job (Y/N)? 
2021-10-05T18:20:38.1736257Z ninja: error: GetFileAttributesEx(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/SumKernel.cpp.DEFAULT.cpp.obj): Access is denied.
2021-10-05T18:20:38.3024957Z 
2021-10-05T18:20:38.3665057Z 
2021-10-05T18:20:38.5040730Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/SumKernel.cpp.DEFAULT.cpp.obj): Permission denied
2021-10-05T18:20:38.5242022Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/TensorCompareKernel.cpp.DEFAULT.cpp.obj): Permission denied
2021-10-05T18:20:38.5245268Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/SortingKernel.cpp.DEFAULT.cpp.obj): Permission denied
2021-10-05T18:20:38.5248003Z ninja: error: GetFileAttributesEx(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp.DEFAULT.cpp.obj): Access is denied.
2021-10-05T18:20:38.5249589Z 
2021-10-05T18:20:38.5249860Z 
2021-10-05T18:20:38.5251423Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp.DEFAULT.cpp.obj): Permission denied
2021-10-05T18:20:38.6778395Z ##[error]The operation was canceled.
2021-10-05T18:20:38.7412005Z ##[group]Run actions/upload-artifact@v2
2021-10-05T18:20:38.7412449Z with:
2021-10-05T18:20:38.7412741Z   retention-days: 14
2021-10-05T18:20:38.7413618Z   if-no-files-found: error
2021-10-05T18:20:38.7413991Z   name: win-vs2019-cpu-py3
2021-10-05T18:20:38.7414432Z   path: C:\1308713065\build-results
2021-10-05T18:20:38.7414746Z env:
2021-10-05T18:20:38.7415085Z   BUILD_ENVIRONMENT: win-vs2019-cpu-py3
2021-10-05T18:20:38.7415456Z   BUILD_WHEEL: 1

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (5/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:12.5024898Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:12.5017456Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:12.5018980Z   GITHUB_TOKEN: ***
2021-10-05T18:20:12.5020487Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:12.5022247Z   JOB_BASE_NAME: linux-xenial-cuda11.3-py3.6-gcc7-test
2021-10-05T18:20:12.5023021Z   TEST_CONFIG: distributed
2021-10-05T18:20:12.5023466Z   SHARD_NUMBER: 1
2021-10-05T18:20:12.5023879Z   NUM_TEST_SHARDS: 1
2021-10-05T18:20:12.5024372Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:12.5024898Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:12.5025382Z   GPU_FLAG: --gpus all
2021-10-05T18:20:12.5025787Z   SHM_SIZE: 2g
2021-10-05T18:20:12.5026185Z ##[endgroup]
2021-10-05T18:20:13.5050114Z 6939eb89059d
2021-10-05T18:20:13.8298703Z Deleted Containers:
2021-10-05T18:20:13.8299841Z 6939eb89059d11a2bcd8aacaf186d2adf8b3b0abb37d5d1e0b721a65b971fdb1
2021-10-05T18:20:13.8300671Z 
2021-10-05T18:20:17.9892467Z Deleted Images:
2021-10-05T18:20:17.9894853Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:17.9897879Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7@sha256:44f979255f5f29448c4ab091295c81e442d5fc85f4c85813fd48198dfec15f0e

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (6/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:09:56.9520544Z test_udf_remote_...yUniqueId(created_on=0, local_id=0) to be created.
2021-10-05T18:09:18.4815830Z frame #12: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7fecf54b24a3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-05T18:09:18.4817680Z frame #13: <unknown function> + 0xc92bd (0x7fecf53e32bd in /opt/conda/lib/libstdc++.so.6)
2021-10-05T18:09:18.4819577Z frame #14: <unknown function> + 0x76ba (0x7fed0a7ce6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-10-05T18:09:18.4821294Z frame #15: clone + 0x6d (0x7fed0a50451d in /lib/x86_64-linux-gnu/libc.so.6)
2021-10-05T18:09:18.4822020Z 
2021-10-05T18:09:18.7305382Z ok (3.316s)
2021-10-05T18:09:33.4643097Z   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (14.734s)
2021-10-05T18:09:42.2893177Z   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (8.825s)
2021-10-05T18:09:45.6052475Z   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (3.316s)
2021-10-05T18:09:52.9270605Z   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (7.322s)
2021-10-05T18:09:56.9520544Z   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTest) ... [E request_callback_no_python.cpp:559] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
2021-10-05T18:09:56.9523461Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
2021-10-05T18:09:56.9526223Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fec915fa669 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-05T18:09:56.9529775Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7fec915f6c12 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-05T18:09:56.9533379Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7fec915f85ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-05T18:09:56.9536891Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4a4 (0x7fec958790d4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:09:56.9541484Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7fec95869591 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:09:56.9546796Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7fec9dccd808 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-10-05T18:09:56.9551478Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7fec9586dde4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:09:56.9556426Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7fec9dccce05 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-10-05T18:09:56.9559789Z frame #8: <unknown function> + 0x403593a (0x7fec9586a93a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (default, 1, 2, linux.8xlarge.nvidia.gpu) (7/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:46.8511747Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:46.8497030Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:46.8499762Z   GITHUB_TOKEN: ***
2021-10-05T18:20:46.8502627Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:46.8506036Z   JOB_BASE_NAME: linux-xenial-cuda11.3-py3.6-gcc7-test
2021-10-05T18:20:46.8507555Z   TEST_CONFIG: default
2021-10-05T18:20:46.8508382Z   SHARD_NUMBER: 1
2021-10-05T18:20:46.8509738Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:46.8510692Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:46.8511747Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:46.8512705Z   GPU_FLAG: --gpus all
2021-10-05T18:20:46.8513509Z   SHM_SIZE: 2g
2021-10-05T18:20:46.8514287Z ##[endgroup]
2021-10-05T18:20:48.0765707Z 15f4fdcb2808
2021-10-05T18:20:48.4186693Z Deleted Containers:
2021-10-05T18:20:48.4188075Z 15f4fdcb2808ecbd75f2cb357704aa16016b7506954273da28321bceb8c949c8
2021-10-05T18:20:48.4189083Z 
2021-10-05T18:20:52.4930039Z Deleted Images:
2021-10-05T18:20:52.4932484Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:52.4935776Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7@sha256:44f979255f5f29448c4ab091295c81e442d5fc85f4c85813fd48198dfec15f0e

See GitHub Actions build linux-bionic-py3.8-gcc9-coverage / test (distributed, 1, 1, linux.2xlarge) (8/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:17:46.2502844Z test_udf_remote_...yUniqueId(created_on=0, local_id=0) to be created.
2021-10-05T18:17:05.5924873Z frame #15: <unknown function> + 0x48a6a (0x7fcdfff17a6a in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-10-05T18:17:05.5926600Z frame #16: <unknown function> + 0xc9039 (0x7fcdffe23039 in /opt/conda/lib/libstdc++.so.6)
2021-10-05T18:17:05.5928121Z frame #17: <unknown function> + 0x76db (0x7fce23b476db in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-10-05T18:17:05.5929861Z frame #18: clone + 0x3f (0x7fce2387071f in /lib/x86_64-linux-gnu/libc.so.6)
2021-10-05T18:17:05.5930612Z 
2021-10-05T18:17:06.0063018Z ok (3.725s)
2021-10-05T18:17:21.2553854Z   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (15.249s)
2021-10-05T18:17:30.4956212Z   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (9.238s)
2021-10-05T18:17:34.2203982Z   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (3.727s)
2021-10-05T18:17:41.9526683Z   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (7.732s)
2021-10-05T18:17:46.2502844Z   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTest) ... [E request_callback_no_python.cpp:559] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":385, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
2021-10-05T18:17:46.2505416Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:385 (most recent call first):
2021-10-05T18:17:46.2508073Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x59 (0x7f661c8b42d9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-10-05T18:17:46.2510184Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa3 (0x7f661c88ae44 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-10-05T18:17:46.2511998Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x61 (0x7f661c8b16c1 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
2021-10-05T18:17:46.2513765Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x628 (0x7f6625e9a288 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:17:46.2516061Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x8c (0x7f6625e808cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:17:46.2518725Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xf5 (0x7f6636866505 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
2021-10-05T18:17:46.2521101Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x1f0 (0x7f6625e87670 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
2021-10-05T18:17:46.2523473Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x60 (0x7f6636865dd0 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
2021-10-05T18:17:46.2525061Z frame #8: <unknown function> + 0x935b400 (0x7f6625e7c400 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build linux-bionic-py3.6-clang9 / test (default, 1, 2, linux.2xlarge) (9/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:31.7759337Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:31.7754299Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:31.7755480Z   GITHUB_TOKEN: ***
2021-10-05T18:20:31.7756426Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:31.7757514Z   JOB_BASE_NAME: linux-bionic-py3.6-clang9-test
2021-10-05T18:20:31.7757998Z   TEST_CONFIG: default
2021-10-05T18:20:31.7758301Z   SHARD_NUMBER: 1
2021-10-05T18:20:31.7758600Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:31.7758944Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:31.7759337Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:31.7759659Z   SHM_SIZE: 1g
2021-10-05T18:20:31.7759932Z ##[endgroup]
2021-10-05T18:20:32.1711197Z 02363de730e6
2021-10-05T18:20:32.5385620Z Deleted Containers:
2021-10-05T18:20:32.5386563Z 02363de730e6bd5ecc84d0878b16500c7b027876d9bb2e12655766e796475853
2021-10-05T18:20:32.5387072Z 
2021-10-05T18:20:35.6009876Z Deleted Images:
2021-10-05T18:20:35.6011870Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:35.6013902Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9@sha256:d66820550f9cb19925b0bdfd57b420ad3bf0c896f71fec4359d1286ebe5786ce
2021-10-05T18:20:35.6015494Z deleted: sha256:a392f6283b11fc2d2c39cdc94e4804a5fce5bfdafa8fcf70c9c75911b3cbbcd3

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (default, 2, 2, linux.2xlarge) (10/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:17.1741398Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:17.1736371Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:17.1737392Z   GITHUB_TOKEN: ***
2021-10-05T18:20:17.1738381Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:17.1739513Z   JOB_BASE_NAME: linux-xenial-py3.6-gcc5.4-test
2021-10-05T18:20:17.1740014Z   TEST_CONFIG: default
2021-10-05T18:20:17.1740325Z   SHARD_NUMBER: 2
2021-10-05T18:20:17.1740634Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:17.1741006Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:17.1741398Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:17.1741735Z   SHM_SIZE: 1g
2021-10-05T18:20:17.1742014Z ##[endgroup]
2021-10-05T18:20:17.5597554Z 62448c72dd35
2021-10-05T18:20:17.8892020Z Deleted Containers:
2021-10-05T18:20:17.8892828Z 62448c72dd35889ddec809a16b3682d123b74e1a4d4474885373771a644d844c
2021-10-05T18:20:17.8893329Z 
2021-10-05T18:20:22.1716524Z Deleted Images:
2021-10-05T18:20:22.1718089Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine:latest
2021-10-05T18:20:22.1720591Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine@sha256:def822f9851ca422481ec6fee59a9966f12b351c62ccb9aca841526ffaa9f748
2021-10-05T18:20:22.1723048Z deleted: sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec

See GitHub Actions build linux-bionic-py3.6-clang9 / test (default, 2, 2, linux.2xlarge) (11/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:37.5705221Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:37.5696512Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:37.5698147Z   GITHUB_TOKEN: ***
2021-10-05T18:20:37.5699815Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:37.5701765Z   JOB_BASE_NAME: linux-bionic-py3.6-clang9-test
2021-10-05T18:20:37.5702645Z   TEST_CONFIG: default
2021-10-05T18:20:37.5703246Z   SHARD_NUMBER: 2
2021-10-05T18:20:37.5703809Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:37.5704481Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:37.5705221Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:37.5705828Z   SHM_SIZE: 1g
2021-10-05T18:20:37.5706380Z ##[endgroup]
2021-10-05T18:20:37.9914522Z 9851c8176b64
2021-10-05T18:20:38.5851242Z Deleted Containers:
2021-10-05T18:20:38.5852033Z 9851c8176b6495fb020ea1b05ed5da10699feaea89a901ca84bd3715d5b66aa6
2021-10-05T18:20:38.5852582Z 
2021-10-05T18:20:44.7334079Z Deleted Images:
2021-10-05T18:20:44.7335535Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine:latest
2021-10-05T18:20:44.7337662Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine@sha256:def822f9851ca422481ec6fee59a9966f12b351c62ccb9aca841526ffaa9f748
2021-10-05T18:20:44.7339747Z deleted: sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (default, 1, 2, linux.2xlarge) (12/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:31.3406530Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:31.3401200Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:31.3402389Z   GITHUB_TOKEN: ***
2021-10-05T18:20:31.3403400Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:31.3404556Z   JOB_BASE_NAME: linux-xenial-py3.6-gcc5.4-test
2021-10-05T18:20:31.3405072Z   TEST_CONFIG: default
2021-10-05T18:20:31.3405399Z   SHARD_NUMBER: 1
2021-10-05T18:20:31.3405725Z   NUM_TEST_SHARDS: 2
2021-10-05T18:20:31.3406113Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:31.3406530Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:31.3406891Z   SHM_SIZE: 1g
2021-10-05T18:20:31.3407186Z ##[endgroup]
2021-10-05T18:20:31.7117747Z 606db4cf66a6
2021-10-05T18:20:31.9188630Z Deleted Containers:
2021-10-05T18:20:31.9190279Z 606db4cf66a617be0129669f45dd5fc6a2bbb7f1840d975a42b827028e6414a4
2021-10-05T18:20:31.9191438Z 
2021-10-05T18:20:34.3892423Z Deleted Images:
2021-10-05T18:20:34.3894418Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine:latest
2021-10-05T18:20:34.3896524Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine@sha256:def822f9851ca422481ec6fee59a9966f12b351c62ccb9aca841526ffaa9f748
2021-10-05T18:20:34.3898522Z deleted: sha256:6dbb9cc54074106d46d4ccb330f2a40a682d49dda5f4844962b7dce9fe44aaec

See GitHub Actions build win-vs2019-cuda11.3-py3 / build (13/15)

Step: "Upload artifacts to s3" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:03.2593794Z ninja: error: remo...native/mkldnn/UnaryOps.cpp.obj): Permission denied
2021-10-05T18:19:54.0549210Z Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29337 for x64
2021-10-05T18:19:54.2995218Z Copyright (C) Microsoft Corporation.  All rights reserved.
2021-10-05T18:19:54.5448766Z 
2021-10-05T18:20:02.8303433Z ^Cninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Utils.cpp.obj): Permission denied
2021-10-05T18:20:02.8306463Z ninja: build stopped: interrupted by user.
2021-10-05T18:20:02.9894788Z Terminate batch job (Y/N)? 
2021-10-05T18:20:03.0068733Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/utils/Factory.cpp.obj): Permission denied
2021-10-05T18:20:03.0378672Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/UnaryOps.cpp.obj): Permission denied
2021-10-05T18:20:03.1980772Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Utils.cpp.obj): Permission denied
2021-10-05T18:20:03.2264139Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/utils/Factory.cpp.obj): Permission denied
2021-10-05T18:20:03.2593794Z ninja: error: remove(caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/UnaryOps.cpp.obj): Permission denied
2021-10-05T18:20:03.5417337Z ##[error]The operation was canceled.
2021-10-05T18:20:03.5905041Z ##[group]Run actions/upload-artifact@v2
2021-10-05T18:20:03.5905607Z with:
2021-10-05T18:20:03.5905909Z   retention-days: 14
2021-10-05T18:20:03.5906296Z   if-no-files-found: error
2021-10-05T18:20:03.5906720Z   name: win-vs2019-cuda11.3-py3
2021-10-05T18:20:03.5907129Z   path: C:\1308713073\build-results
2021-10-05T18:20:03.5907440Z env:
2021-10-05T18:20:03.5907815Z   BUILD_ENVIRONMENT: win-vs2019-cuda11.3-py3
2021-10-05T18:20:03.5908217Z   BUILD_WHEEL: 1

See GitHub Actions build linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (14/15)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:20:40.1186687Z CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:40.1181604Z   PR_LABELS: [
  "cla signed"
]
2021-10-05T18:20:40.1182612Z   GITHUB_TOKEN: ***
2021-10-05T18:20:40.1183580Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9:74e757e8b0cf750d2f91db6aa4c29640abce32ea
2021-10-05T18:20:40.1184727Z   JOB_BASE_NAME: linux-bionic-py3.6-clang9-test
2021-10-05T18:20:40.1185210Z   TEST_CONFIG: noarch
2021-10-05T18:20:40.1185531Z   SHARD_NUMBER: 1
2021-10-05T18:20:40.1185839Z   NUM_TEST_SHARDS: 1
2021-10-05T18:20:40.1186196Z   PYTORCH_IGNORE_DISABLED_ISSUES: 
2021-10-05T18:20:40.1186687Z   CONTINUE_THROUGH_ERROR: false
2021-10-05T18:20:40.1187010Z   SHM_SIZE: 1g
2021-10-05T18:20:40.1187305Z ##[endgroup]
2021-10-05T18:20:40.7325238Z cb43ea8ba474
2021-10-05T18:20:40.7326155Z 4467d15c7de0
2021-10-05T18:20:42.1586193Z Deleted Containers:
2021-10-05T18:20:42.1587295Z cb43ea8ba474171df9c6cadc87270459fc30bcb728907159351506d3a9552270
2021-10-05T18:20:42.1588418Z 4467d15c7de077548ee453b5ff2945379e1b518143ea54907577a057dd998cbe
2021-10-05T18:20:42.1588957Z 
2021-10-05T18:20:52.9139350Z Deleted Images:
2021-10-05T18:20:52.9141393Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.6-clang9:74e757e8b0cf750d2f91db6aa4c29640abce32ea

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (15/15)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-05T18:04:29.4030730Z The PR is introduc...m to confirm whether this change is wanted or not.
2021-10-05T18:04:29.4017485Z processing existing schema:  alltoall_base(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor _1, Tensor _2, int[] _3, int[] _4) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4018898Z processing existing schema:  alltoall(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, Tensor[] _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4020207Z processing existing schema:  send(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4021496Z processing existing schema:  recv(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4022789Z processing existing schema:  recv_anysource(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4024170Z processing existing schema:  barrier(__torch__.torch.classes.dist_c10d.ProcessGroup _0) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-10-05T18:04:29.4025231Z processing existing schema:  __init__(__torch__.torch.classes.dist_c10d.frontend _0) -> (NoneType _0)
2021-10-05T18:04:29.4026622Z processing existing schema:  new_process_group_helper(__torch__.torch.classes.dist_c10d.frontend _0, int _1, int _2, int[] _3, str _4, __torch__.torch.classes.dist_c10d.Store _5, str? _6, int _7) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-10-05T18:04:29.4028226Z processing existing schema:  get_process_group_by_name(__torch__.torch.classes.dist_c10d.frontend _0, str _1) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-10-05T18:04:29.4029588Z processing existing schema:  get_name_of_process_group(__torch__.torch.classes.dist_c10d.frontend _0, __torch__.torch.classes.dist_c10d.ProcessGroup _1) -> (str _0)
2021-10-05T18:04:29.4030730Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2021-10-05T18:04:29.4031317Z 
2021-10-05T18:04:29.4031565Z Broken ops: [
2021-10-05T18:04:29.4032455Z 	aten::fft_ihfftn(Tensor self, int[1]? s=None, int[1]? dim=None, str? norm=None) -> (Tensor)
2021-10-05T18:04:29.4033335Z 	aten::fft_ihfftn.out(Tensor self, int[1]? s=None, int[1]? dim=None, str? norm=None, *, Tensor(a!) out) -> (Tensor(a!))
2021-10-05T18:04:29.4034164Z 	aten::special_softmax(Tensor self, int dim, int? dtype=None) -> (Tensor)
2021-10-05T18:04:29.4034943Z 	aten::fft_hfftn(Tensor self, int[1]? s=None, int[1]? dim=None, str? norm=None) -> (Tensor)
2021-10-05T18:04:29.4035787Z 	aten::fft_hfftn.out(Tensor self, int[1]? s=None, int[1]? dim=None, str? norm=None, *, Tensor(a!) out) -> (Tensor(a!))
2021-10-05T18:04:29.4036633Z 	aten::fft_ihfft2(Tensor self, int[1]? s=None, int[1] dim=[-2, -1], str? norm=None) -> (Tensor)
2021-10-05T18:04:29.4037483Z 	aten::fft_ihfft2.out(Tensor self, int[1]? s=None, int[1] dim=[-2, -1], str? norm=None, *, Tensor(a!) out) -> (Tensor(a!))
2021-10-05T18:04:29.4038312Z 	aten::fft_hfft2(Tensor self, int[1]? s=None, int[1] dim=[-2, -1], str? norm=None) -> (Tensor)

16 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_ios_12_5_1_x86_64_build Build 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang5_mobile_custom_build_static Build 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build Build 🔁 rerun
CircleCI pytorch_linux_xenial_py3_6_gcc5_4_test Download Docker image 🔁 rerun
CircleCI pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit pytorch android gradle custom build single architecture (for PR) 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang7_asan_build Build 🔁 rerun
CircleCI pytorch_vulkan_linux_bionic_py3_6_clang9_test Set Up CI Environment After attach_workspace 🔁 rerun
CircleCI pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single pytorch android gradle custom build single architecture (for PR) 🔁 rerun
CircleCI pytorch_linux_pytorch_linux_xenial_py3_6_gcc5_4_distributed_test Download Docker image 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang5_mobile_custom_build_dynamic Build 🔁 rerun
CircleCI pytorch_macos_10_13_py3_lite_interpreter_build_test Unknown 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang7_onnx_ort_test1 Download Docker image 🔁 rerun
CircleCI pytorch_xla_linux_bionic_py3_6_clang9_build Build 🔁 rerun
CircleCI pytorch_ios_12_5_1_x86_64_full_jit_build Build 🔁 rerun
CircleCI pytorch_macos_10_13_py3_test Unknown 🔁 rerun
CircleCI pytorch_linux_xenial_py3_clang7_onnx_ort_test2 Download Docker image 🔁 rerun

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / build-docs (cpp) (1/1)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun) ❄️

2021-10-05T18:04:07.6845477Z E: Failed to fetch...: /etc/ssl/certs/ca-certificates.crt CRLfile: none
2021-10-05T18:04:07.6411056Z 
2021-10-05T18:04:07.6411376Z Reading package lists... 99%
2021-10-05T18:04:07.6411619Z 
2021-10-05T18:04:07.6606815Z Reading package lists... 99%
2021-10-05T18:04:07.6607176Z 
2021-10-05T18:04:07.6607484Z Reading package lists... Done
2021-10-05T18:04:07.6607736Z 
2021-10-05T18:04:07.6842167Z W: The repository 'https://deb.nodesource.com/node_12.x xenial Release' does not have a Release file.
2021-10-05T18:04:07.6843271Z N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
2021-10-05T18:04:07.6844174Z N: See apt-secure(8) manpage for repository creation and user configuration details.
2021-10-05T18:04:07.6845477Z E: Failed to fetch https://deb.nodesource.com/node_12.x/dists/xenial/main/source/Sources  server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
2021-10-05T18:04:07.6846576Z E: Some index files failed to download. They have been ignored, or old ones used instead.
2021-10-05T18:04:07.6924225Z 
2021-10-05T18:04:07.7363626Z Reading package lists... 0%
2021-10-05T18:04:07.7363982Z 
2021-10-05T18:04:07.7470337Z Reading package lists... 0%
2021-10-05T18:04:07.7470595Z 
2021-10-05T18:04:07.7941201Z Reading package lists... 1%
2021-10-05T18:04:07.7941490Z 
2021-10-05T18:04:07.7941789Z Reading package lists... 8%
2021-10-05T18:04:07.7942088Z 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@zhouzhuojie zhouzhuojie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, waiting for all the signals and will merge after that

@zhouzhuojie
Copy link
Contributor

Looks like there's a change we need to fix in the release branch, @prabhat00155 can you try rebasing after this commit? #65787

cc @malfet

@prabhat00155
Copy link
Contributor Author

@zhouzhuojie I see a couple of unrelated failures. Is there something preventing the merge?

@datumbox
Copy link
Contributor

datumbox commented Oct 4, 2021

How can we unblock this? This needs to be followed up by cherrypicking PR #65921 which contains minor doc corrections and refactorings.

@datumbox
Copy link
Contributor

datumbox commented Oct 5, 2021

@zhouzhuojie Sorry for the multiple pings, just being conscious about the deadline for merging this in the release branch. Could you let us know if there is anything we can do on our side?

…ytorch#65495)

Summary:
While implementing [EMA](pytorch/vision#4381 extends AveragedModel) in torchvision, update_parameters() from AveragedModel could not be used as it did not handle state_dict(), so a custom update_parameters() needed to be defined in [EMA class](pytorch/vision#4406). This PR aims to handle this scenario removing the need for this custom update_parameters() implementation.

Discussion: pytorch/vision#4406 (review)

Pull Request resolved: pytorch#65495

Reviewed By: datumbox

Differential Revision: D31176742

Pulled By: prabhat00155

fbshipit-source-id: 326d14876018f21cf602bab5eaba344678dbabe2
(cherry picked from commit 2ea724b)
@pytorch-probot
Copy link

pytorch-probot bot commented Oct 5, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/prabhat00155/pytorch/blob/449a40a0b2bad8792e5667e707c0ac9c206c61ac/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-bionic-py3.8-gcc9-coverage ciflow/all, ciflow/coverage, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
win-vs2019-cuda10.2-py3 ciflow/all, ciflow/cuda, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

Summary:
Discussion: pytorch#65495 (comment)

Pull Request resolved: pytorch#65921

Reviewed By: albanD

Differential Revision: D31310105

Pulled By: prabhat00155

fbshipit-source-id: 417691832a7c793744830c11e0ce53e3972d21a3
(cherry picked from commit c7748fc)
@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #65755 (449a40a) into release/1.10 (6aadfda) will increase coverage by 46.11%.
The diff coverage is 87.50%.

@@                Coverage Diff                @@
##           release/1.10   #65755       +/-   ##
=================================================
+ Coverage         20.20%   66.32%   +46.11%     
=================================================
  Files                23      738      +715     
  Lines              5232    94259    +89027     
=================================================
+ Hits               1057    62515    +61458     
- Misses             4175    31744    +27569     

@malfet malfet merged commit 5f1a434 into pytorch:release/1.10 Oct 6, 2021
@prabhat00155 prabhat00155 deleted the prabhat00155/cherrypick_pr branch October 6, 2021 18:19
@datumbox
Copy link
Contributor

datumbox commented Oct 6, 2021

Thanks a lot for your help merging this. :)

@gchanan
Copy link
Contributor

gchanan commented Oct 6, 2021

Sorry to jump in here, but I don't understand why this satisfies the cherry-pick criteria.

First off, @malfet we should clarify how the criteria should be specified. In #65438, it says:

  1. Critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks

The criteria for this one is given as:

Criteria Category: 2

which isn't informative. We should update the description to something like:

  1. Critical fixes for: silent correctness, backwards compatibility, crashes, deadlocks, (large) memory leaks (please specify which one and detail the rationale.

In this particular case, from my quick reading it seems like:

  1. we are unsure if "use_state_dict" is the right specifier, as opposed to something focused on buffers ("include_buffers" ?). Including this now could set us up for BC problems later.
  2. This issue has existed for 1.5 years, from the initial version, with no prioritization. That seems difficult to describe as critical.
  3. The specific example has a workaround as described in Added update_parameters to EMA to fix calculation vision#4406

@datumbox
Copy link
Contributor

datumbox commented Oct 7, 2021

@gchanan Thanks for the feedback.

Let me provide a summary of the what/why so that we can decide easier next steps:

  • The contribution was meant to come earlier but got delayed as there was no assigned code-owner. Note that now this issue is resolved and going forwards we will stay in close contact with Alban.
  • This PR contains both @prabhat00155 original PR (Added option to update parameters using state_dict in AveragedModel #65495) and improvements requested by @albanD (Added validation of mode parameter in AveragedModel #65921).
  • This PR unblocks some of the use-cases of the "Batteries Included" project of torchvision and that's why we requested to include it in the release. After missing the deadline due to the above reason, Prabhat opened a request using the recommended process.
  • Indeed the issue on the specific mechanism existed for very long time. At this comment, I provide more information about why we think it's important to get it fixed. You are right to say that torchvision has a workaround in place, but that is a temporary solution.

Let us know how you would like us to proceed. Would you rather rollback the PR on the release branch, leave as is or bring a quick PR that addresses your concerns around the use_state_dict param?

@malfet
Copy link
Contributor

malfet commented Oct 7, 2021

@gchanan good point, will update template for 1.10 and use this verbiage going forward.

@gchanan
Copy link
Contributor

gchanan commented Oct 7, 2021

my preference would be that we improve the use_state_dict on the main branch and release it with 1.11. If I understand correctly, you can apply the temporary workaround to torchvision in 1.10 so aren't really worse off, and we can have a cleaned up version of everything released in 1.11 with proper bake time. What do you think?

@datumbox
Copy link
Contributor

datumbox commented Oct 7, 2021

@gchanan Sounds good. Just to confirm, your preference is to:

  1. Revert the PR only on the pytorch:release/1.10 branch.
  2. Follow up with a new PR on pytorch:main that improves the use_state_dict
  3. Release the feature on 1.11 and for now use our workaround on TorchVision.

If that's the case that's OK with me.

@malfet If the above is confirmed, who is responsible for reverting the PR? You or us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants