Fix permission error for ORTModule lock file#7814
Conversation
tlh20
left a comment
There was a problem hiding this comment.
If I've followed through the underlying PyTorch code there is an existing environment variable TORCH_EXTENSIONS_DIR for overriding the default location. Should we give that priority (if set)?
https://github.com/pytorch/pytorch/blob/aad2ad883ad70e2512d5d6da976a5a8f656cfb05/torch/utils/cpp_extension.py#L1600-L1603
Thanks, that is a reasonable idea, although I usually try not to depend on environment variables. It can get complicated when customers use Azure ML or other computes in which they don't have full control over the environment. OS differences might lead to slight differences too. My original intent was to add Maybe we could do both. Prefer explicit configuration but ultimately fallback to env var |
c7b82ed to
aa37eee
Compare
| print("WARNING: ORTModule detected PyTorch CPP extension's lock file during initialization, " | ||
| "which can cause unexpected hangs. " | ||
| f"Delete {os.path.join(TORCH_CPP_BUILD_DIR,'lock')} to prevent unexpected behavior.") | ||
| f"Delete {os.path.join(TORCH_CPP_BUILD_DIR,'lock')} if a hang occurs.") |
There was a problem hiding this comment.
We should avoid using the word "hang" in our code. We can say it stops responding.
* Fix bug in Transpose CUDA kernel (#7329) * Fix permission error for ORTModule lock file (#7814) * fix topo sort in quant tool (#7833) * fix topo sort in quant tool * add unit test and make the topo sort stable * Relax tol for Conv1D fp16 test (#7844) * Relax tol for Conv1D fp16 test Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Resolve issue with wrapped ORTModule load_state_dict (#7847) * Encapsulate children modules inside a ModuleAccessor object to prevent erroneuos iteration over children while loading the state dictionary * Add named_models, models, apply methods, change ModuleAccessor to ModuleMetadata and modify unit tests * Change ModuleMetadata module getter logic, raise NotImplementedError for add_modules * Add comment explaining why overriding _load_from_state_dict method is needed * fixed bugs in packed mode and enable pack mode tests in ci (#7848) * fixed bugs in packed mode and enable pack mode tests in ci * removed unnecessary space * pr comments * pr comments * disable an average pool test * try disabling another avg pool * disable more avg pool tests * disable maxpool tests * add environment variable to control default training package's local version (#7849) * [js] update documents (#7852) * [js] update documents * escape double quotes * update operators.md * resolve comments * Support bool type for Pad CPU (#7856) * Initial commit * update * nit * Include ORT C/C++ API headers in the ORT Mobile AAR package (#7858) * Add header files of ort c/c++ api to aar package * Move header file selection to cmake based on EP choice * fix duplicated node name (#7865) * Clean up CPU kernel definition for opset 13 Pad (#7867) Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com> Co-authored-by: Sherlock <baihan.huang@gmail.com> Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: baijumeswani <bmeswani@microsoft.com> Co-authored-by: Tixxx <tix@microsoft.com> Co-authored-by: liqunfu <liqfu@microsoft.com> Co-authored-by: Yulong Wang <yulongw@microsoft.com> Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
PR #7740 naively assumed the Python Package folder where ORTModule was installed was writable, which caused loading issues when it was read-only.
This PR tries to fallback $HOME if it is writable, but ultimately raises an exception letting the user know both standard folders are not writable