Skip to content

Fix permission error for ORTModule lock file#7814

Merged
thiagocrepaldi merged 3 commits intomasterfrom
thiagofc/fallback-lock
May 26, 2021
Merged

Fix permission error for ORTModule lock file#7814
thiagocrepaldi merged 3 commits intomasterfrom
thiagofc/fallback-lock

Conversation

@thiagocrepaldi
Copy link
Contributor

PR #7740 naively assumed the Python Package folder where ORTModule was installed was writable, which caused loading issues when it was read-only.

This PR tries to fallback $HOME if it is writable, but ultimately raises an exception letting the user know both standard folders are not writable

Copy link
Contributor

@tlh20 tlh20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've followed through the underlying PyTorch code there is an existing environment variable TORCH_EXTENSIONS_DIR for overriding the default location. Should we give that priority (if set)?
https://github.com/pytorch/pytorch/blob/aad2ad883ad70e2512d5d6da976a5a8f656cfb05/torch/utils/cpp_extension.py#L1600-L1603

@thiagocrepaldi
Copy link
Contributor Author

If I've followed through the underlying PyTorch code there is an existing environment variable TORCH_EXTENSIONS_DIR for overriding the default location. Should we give that priority (if set)?
https://github.com/pytorch/pytorch/blob/aad2ad883ad70e2512d5d6da976a5a8f656cfb05/torch/utils/cpp_extension.py#L1600-L1603

Thanks, that is a reasonable idea, although I usually try not to depend on environment variables. It can get complicated when customers use Azure ML or other computes in which they don't have full control over the environment. OS differences might lead to slight differences too. My original intent was to add build_directory as an advanced configuration to ORTModule (which is a pending task yet). This flag is explicitly set to torch.cpp_extension.load_inline method. The missing part is allowing user the user to specify it, which would not depend modifying dockerfiles, azure yaml configuration, etc

Maybe we could do both. Prefer explicit configuration but ultimately fallback to env var

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/fallback-lock branch from c7b82ed to aa37eee Compare May 26, 2021 00:01
@thiagocrepaldi thiagocrepaldi requested a review from a team as a code owner May 26, 2021 16:51
@thiagocrepaldi thiagocrepaldi merged commit c5ea590 into master May 26, 2021
@thiagocrepaldi thiagocrepaldi deleted the thiagofc/fallback-lock branch May 26, 2021 21:18
print("WARNING: ORTModule detected PyTorch CPP extension's lock file during initialization, "
"which can cause unexpected hangs. "
f"Delete {os.path.join(TORCH_CPP_BUILD_DIR,'lock')} to prevent unexpected behavior.")
f"Delete {os.path.join(TORCH_CPP_BUILD_DIR,'lock')} if a hang occurs.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid using the word "hang" in our code. We can say it stops responding.

xzhu1900 added a commit that referenced this pull request May 28, 2021
* Fix bug in Transpose CUDA kernel (#7329)

* Fix permission error for ORTModule lock file (#7814)

* fix topo sort in quant tool (#7833)

* fix topo sort in quant tool

* add unit test and make the topo sort stable

* Relax tol for Conv1D fp16 test (#7844)

* Relax tol for Conv1D fp16 test

Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Resolve issue with wrapped ORTModule load_state_dict (#7847)

* Encapsulate children modules inside a ModuleAccessor object to prevent erroneuos iteration over children while loading the state dictionary

* Add named_models, models, apply methods, change ModuleAccessor to ModuleMetadata and modify unit tests

* Change ModuleMetadata module getter logic, raise NotImplementedError for add_modules

* Add comment explaining why overriding _load_from_state_dict method is needed

* fixed bugs in packed mode and enable pack mode tests in ci (#7848)

* fixed bugs in packed mode and enable pack mode tests in ci

* removed unnecessary space

* pr comments

* pr comments

* disable an average pool test

* try disabling another avg pool

* disable more avg pool tests

* disable maxpool tests

* add environment variable to control default training package's local version (#7849)

* [js] update documents (#7852)

* [js] update documents

* escape double quotes

* update operators.md

* resolve comments

* Support bool type for Pad CPU (#7856)

* Initial commit

* update

* nit

* Include ORT C/C++ API headers in the ORT Mobile AAR package (#7858)

* Add header files of ort c/c++ api to aar package

* Move header file selection to cmake based on EP choice

* fix duplicated node name (#7865)

* Clean up CPU kernel definition for opset 13 Pad (#7867)

Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
Co-authored-by: Sherlock <baihan.huang@gmail.com>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: baijumeswani <bmeswani@microsoft.com>
Co-authored-by: Tixxx <tix@microsoft.com>
Co-authored-by: liqunfu <liqfu@microsoft.com>
Co-authored-by: Yulong Wang <yulongw@microsoft.com>
Co-authored-by: Guoyu Wang <62914304+gwang-msft@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants