Skip to content

Add cpp ext lock file check during ORTModule init#7740

Merged
thiagocrepaldi merged 2 commits intomasterfrom
thiagofc/handle-cpp-extension-lock
May 18, 2021
Merged

Add cpp ext lock file check during ORTModule init#7740
thiagocrepaldi merged 2 commits intomasterfrom
thiagofc/handle-cpp-extension-lock

Conversation

@thiagocrepaldi
Copy link
Contributor

@thiagocrepaldi thiagocrepaldi commented May 18, 2021

ORTModule compiles at least 2 Torch CPP extensions in runtime. When ORTModule is interrupted during such compilation, a lock file is left behind, which will cause ORTModule to hang next time it runs

Deleting this lock is tricky as it can break CPP extension framework, so a warning is raised to the user to inform them about the possible issue. They have the choice to delete the lock and resume run themselves

CPP extensions were moved to a separate file

baijumeswani
baijumeswani previously approved these changes May 18, 2021
@snnn
Copy link
Contributor

snnn commented May 18, 2021

I don't know python too much. But if it was done in C/C++, the good practice should be:

  1. Create the lock file if it doesn't exit
  2. (optional) write down the pid into the file
  3. Try lock it. If it failed, read the lock file and report the pid so that the user could know who is using it.

It's ok if the lock file is left behind when the process exited unexpectedly. Because the operating system will auto release the file lock. This is very reliable because the atomicity is guaranteed by the OS.

@thiagocrepaldi
Copy link
Contributor Author

I don't know python too much. But if it was done in C/C++, the good practice should be:

  1. Create the lock file if it doesn't exit
  2. (optional) write down the pid into the file
  3. Try lock it. If it failed, read the lock file and report the pid so that the user could know who is using it.

It's ok if the lock file is left behind when the process exited unexpectedly. Because the operating system will auto release the file lock. This is very reliable because the atomicity is guaranteed by the OS.

This is a PyTorch CPP+ extension framework which we don't have control. This PR just warns user about a possible issue. There are other scenarios in which this can happen, including several instances running in parallel, in which case we can't delete the lock and it is fine to block to serialize the build process

@thiagocrepaldi thiagocrepaldi merged commit e05b151 into master May 18, 2021
@thiagocrepaldi thiagocrepaldi deleted the thiagofc/handle-cpp-extension-lock branch May 18, 2021 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants