Add cpp ext lock file check during ORTModule init#7740
Add cpp ext lock file check during ORTModule init#7740thiagocrepaldi merged 2 commits intomasterfrom
Conversation
ee861e5 to
da9386f
Compare
|
I don't know python too much. But if it was done in C/C++, the good practice should be:
It's ok if the lock file is left behind when the process exited unexpectedly. Because the operating system will auto release the file lock. This is very reliable because the atomicity is guaranteed by the OS. |
This is a PyTorch CPP+ extension framework which we don't have control. This PR just warns user about a possible issue. There are other scenarios in which this can happen, including several instances running in parallel, in which case we can't delete the lock and it is fine to block to serialize the build process |
ORTModule compiles at least 2 Torch CPP extensions in runtime. When ORTModule is interrupted during such compilation, a lock file is left behind, which will cause ORTModule to hang next time it runs
Deleting this lock is tricky as it can break CPP extension framework, so a warning is raised to the user to inform them about the possible issue. They have the choice to delete the lock and resume run themselves
CPP extensions were moved to a separate file