Skip to content

Introduce ExtensionVersioner for C++ extensions#11725

Closed
goldsborough wants to merge 1 commit intopytorch:masterfrom
goldsborough:reload-cpp-ext
Closed

Introduce ExtensionVersioner for C++ extensions#11725
goldsborough wants to merge 1 commit intopytorch:masterfrom
goldsborough:reload-cpp-ext

Conversation

@goldsborough
Copy link
Contributor

@goldsborough goldsborough commented Sep 14, 2018

Python never closes shared library it dlopens. This means that calling load or load_inline (i.e. building a JIT C++ extension) with the same C++ extension name twice in the same Python process will never re-load the library, even if the compiled source code and the underlying shared library have changed. The only way to circumvent this is to create a new library and load it under a new module name.

I fix this, of course, by introducing a layer of indirection. Loading a JIT C++ extension now goes through an ExtensionVersioner, which hashes the contents of the source files as well as build flags, and if this hash changed, bumps an internal version stored for each module name. A bump in the version will result in the ninja file being edited and a new shared library and effectively a new C++ extension to be compiled. For this the version name is appended as _v<version> to the extension name for all versions greater zero.

One caveat is that if you were to update your code many times and always re-load it in the same process, you may end up with quite a lot of shared library objects in your extension's folder under /tmp. I imagine this isn't too bad, since extensions are typically small and there isn't really a good way for us to garbage collect old libraries, since we don't know what still has handles to them.

Fixes #11398 CC @t-vi

@ezyang @gchanan @soumith @fmassa

@t-vi
Copy link
Collaborator

t-vi commented Sep 17, 2018

Thank you!

@goldsborough goldsborough changed the title Introduce ExtensionCache for C++ extensions Introduce ExtensionVersioner for C++ extensions Sep 17, 2018
@goldsborough
Copy link
Contributor Author

@peterjc123 could you do some digging for the Windows failure? I'm seeing

01:10:44 [2/2] "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\bin\Hostx86\x64/link.exe" main.o /nologo /DLL caffe2.lib torch.lib _C.lib /LIBPATH:C:\Jenkins\Miniconda3\libs /LIBPATH:C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\torch\lib /out:reloaded_jit_extension_v1.pyd

01:10:44 FAILED: reloaded_jit_extension_v1.pyd 

01:10:44 "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.11.25503\bin\Hostx86\x64/link.exe" main.o /nologo /DLL caffe2.lib torch.lib _C.lib /LIBPATH:C:\Jenkins\Miniconda3\libs /LIBPATH:C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\torch\lib /out:reloaded_jit_extension_v1.pyd

01:10:44 LINK : fatal error LNK1104: cannot open file 'reloaded_jit_extension_v1.pyd'

01:10:44 ninja: build stopped: subcommand failed.

which means it can't link the second version for some reason. I don't quite understand why it only fails on Windows like this. It seems to compile the first version reloaded_jit_extension_v0.pyd fine but then fails for the second version.

@peterjc123
Copy link
Collaborator

@goldsborough It's probably a permission issue. Is this library already in use during build?

@goldsborough
Copy link
Contributor Author

@peterjc123 Sorry could you elaborate on what you mean?

@peterjc123
Copy link
Collaborator

@goldsborough It seems that reloaded_jit_extension_v1.pyd is being used when compiling. On Windows, deleting a file that is used by another process is not allowed.

@goldsborough
Copy link
Contributor Author

@peterjc123 I see. I don't know why that would be happening. It is going through the exact same code path as version 0, with the exact same function (_jit_compile). It's true that the version 0 library would be in use at the same time, and the ninja file is overwritten for version 1, but time-wise version 1 is built sequentially after version 0, so I don't think there would be interference there.

@ezyang
Copy link
Contributor

ezyang commented Sep 18, 2018

One day, someone is going to deploy hot loadable C++ extensions with PyTorch in prod, and then wonder why after a week the Python process crashes XD

@goldsborough
Copy link
Contributor Author

@ezyang as long as they don't modify the source code they reference from their JIT extension running in prod and re-load it, they should be ok :)

@t-vi
Copy link
Collaborator

t-vi commented Sep 18, 2018 via email

@peterjc123
Copy link
Collaborator

peterjc123 commented Sep 18, 2018

@goldsborough Is reloaded_jit_extension_v1 built twice? When built for the first time, it should be okay from the log. But the second time, it would fail because it is still being used(loaded in python). We could not delete one file under this circumstances.
Possible solution:

  1. Unload the DLL (impossible)
  2. Append hash into filename (hard to load)
  3. Generate libraries in new directories and use the latest one by appending PATHs or using some sorting method (feasible)
  4. Skip this test (not good but it is the last way to go)
17:45:28 Building extension module reloaded_jit_extension...
17:45:28 Loading extension module reloaded_jit_extension...
17:45:28 Using C:\Windows\TEMP\torch_extensions as PyTorch extensions root...
17:45:28 Emitting ninja build file C:\Windows\TEMP\torch_extensions\reloaded_jit_extension\build.ninja...
17:45:28 Building extension module reloaded_jit_extension_v1... (first time: okay)
17:45:28 Loading extension module reloaded_jit_extension_v1...
17:45:28 Using C:\Windows\TEMP\torch_extensions as PyTorch extensions root...
17:45:28 Emitting ninja build file C:\Windows\TEMP\torch_extensions\reloaded_jit_extension\build.ninja...
17:45:28 Building extension module reloaded_jit_extension_v1... (second time: fail)
17:45:28 Traceback (most recent call last):
17:45:28   File "run_test.py", line 392, in <module>
17:45:28     main()
17:45:28   File "run_test.py", line 384, in main
17:45:28     raise RuntimeError(message)
17:45:28 RuntimeError: test_cpp_extensions failed!

@goldsborough goldsborough force-pushed the reload-cpp-ext branch 3 times, most recently from cc69f24 to 0ec2c32 Compare September 19, 2018 15:45
@goldsborough
Copy link
Contributor Author

This is looking healthy. Hooray! I think the one failing test is spurious. Let me retest it

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goldsborough has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@goldsborough
Copy link
Contributor Author

I made a change such that we don't even attempt to re-build an unchanged library. Previously we were just relying on Ninja for this, but now we know this messes with Windows due to access to a file from multiple processes. Thanks for your help @peterjc123.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.utils.cpp_extension.load/load_inline doesn't reload when rerun with the same name

6 participants