classification
Title: Add sys.module_names: list of stdlib module names (Python and extension modules)
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: corona10, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2021-01-18 10:08 by vstinner, last changed 2021-01-19 22:04 by vstinner.

Pull Requests
URL Status Linked Edit
PR 24238 open vstinner, 2021-01-18 12:18
PR 24254 vstinner, 2021-01-18 23:28
PR 24258 merged vstinner, 2021-01-19 21:45
Messages (18)
msg385180 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 10:08
Some use cases require to know if a module comes from the stdlib or not. For example, I would like to only dump extension modules which don't come from the stdlib in bpo-42923: "Py_FatalError(): dump the list of extension modules".

Stdlib modules are special. For example, the maintenance and updates are connected to the Python lifecycle. Stdlib modules cannot be updated with "pip install --upgrade". They are shipped with the system ("system" Python). They are usually "read only": on Unix, only the root user can write into /usr directory where the stdlib is installed, whereas modules installed with "pip install --user" can be modified by the current user.

There is a third party project on PyPI which contains the list of stdlib modules:
https://pypi.org/project/stdlib-list/

There is already sys.builtin_module_names:
"A tuple of strings giving the names of all modules that are compiled into this Python interpreter."
https://docs.python.org/dev/library/sys.html#sys.builtin_module_names

I propose to add a similar sys.module_names tuple of strings (module names).

There are different constraints:

* If we add a public sys attribute, users will likely expect the list to be (1) exhaustive (2) correct
* Some extensions are not built if there are missing dependencies. Dependencies are checked after Python "core" (the sys module) is built.
* Some extensions are not available on some platforms.
* This list should be maintained.

Should we only list top level packages, or also submodules? For example, only list "asyncio", or list the 31 submodules (asyncio.base_events, asyncio.futures, ...)? Maybe it can be decided on a case by case basis. For example, I consider that "os.path" is a stdlib module, even it's just an alias to "posixpath" or "ntpath" depending on the platform.

I propose to include all extensions in the list, even if they are not built/available on some platforms. For example, "winsound" would also be listed on Linux, even if the extension is specific to Windows.

I also propose to include stdlib module names even if they are overridden at runtime using PYTHONPATH with a different implementation. For example, "asyncio" would be in the list, even if an user creates "asyncio.py" file. The list would not depend on sys.path.

--

Another option is to add an attribute to modules to mark them as coming from the stdlib. The API would be an attribute: module.__stdlib__ (bool).

The attribute could be set directly in the module code. For example, add "__stdlib__ = True" in Python modules. Similar idea for C extension modules.

Or the attribute could be set after importing the module, in the import site. But we don't control how stdlib modules are imported.

--

For the specific case of bpo-42923, another option is to use a list of stdlib paths, and check module.__file__ to check if a module is a stdlib module, and also use sys.builtin_module_names. And so don't add any API.
msg385181 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 10:15
On Fedora 33, the stdlib lives in two main directories:

* /usr/lib64/python3.9: Python modules
* /usr/lib64/python3.9/lib-dynload: C extension modules

Example:

>>> import os.path
>>> os.path.dirname(os.__file__) # Python
'/usr/lib64/python3.9'
>>> os.path.dirname(_asyncio.__file__)
'/usr/lib64/python3.9/lib-dynload'

The Python stdlib path can be retrieved with:

>>> import sysconfig; sysconfig.get_paths()['stdlib']
'/usr/lib64/python3.9'

But I'm not sure how to retrieve /usr/lib64/python3.9/lib-dynload path. "platstdlib" is not what I expect:

>>> import sysconfig; sysconfig.get_paths()['platstdlib']
'/usr/lib64/python3.9'

I found DESTDIR in Makefile:

>>> sysconfig.get_config_var('DESTSHARED')
'/usr/lib64/python3.9/lib-dynload'
msg385186 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-01-18 10:42
There is no technical difference between stdlib modules and other modules. Stdlib modules are only special in context of copyright and responsibility.
What if set __author__ = 'PSF' in every stdib module? Or other attributes which would allow to group modules by origin. Perhaps other authors would use that feature too.
msg385194 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 12:36
I wrote PR 24238 which implements an hardcoded list of stdlib module names. The PR uses the CI to ensure that the list always remain up to date. If tomorrow a new module is added, a CI job fails ("file not up to date").

> What if set __author__ = 'PSF' in every stdib module? Or other attributes which would allow to group modules by origin. Perhaps other authors would use that feature too.

Hum, that would go against my bpo-42923 use case.

Also for bpo-42923, I would prefer a tuple of strings ;-)
msg385197 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 13:54
Use cases for sys.module_names:

* When computing dependencies of a project, ignore modules which are part of the stdlib: https://github.com/jackmaney/pypt/issues/3

* Trace the execution of third party code, but ignore stdlib, use --ignore-module option of trace: https://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-python-standard-library-modules

* When reformatting a Python source file, group imports of stdlib modules. The isort module contains the list of stdlib modules, one list py Python version: https://github.com/PyCQA/isort/tree/develop/isort/stdlibs which is generated from the online Python documentation.

---

The isort uses the following script to generate the list of stdlib modules:
https://github.com/PyCQA/isort/blob/develop/scripts/mkstdlibs.py

The script uses sphinx.ext.intersphinx.fetch_inventory(...)["py:module"]. This API uses objects.inv from the online Python documentation. Example of Python 3.9:

   https://docs.python.org/3.9/objects.inv

On the "dev" version, mkstdlibs.py lists 211 modules.
msg385200 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 14:54
(...) This API uses objects.inv from the online Python documentation.

By the way, there is also this documentation page listing Python stdlib modules:
https://docs.python.org/dev/py-modindex.html
msg385201 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 15:06
Another way to gather the list of Python modules: pydoc help("modules") uses pkgutil.walk_packages().

pydoc uses ModuleScanner which uses sys.builtin_module_names and pkgutil.walk_packages().

pkgutil.walk_packages() calls pkgutil.iter_modules() which iterates on sys.meta_path and sys.path, and calls iter_modules() on each importer.

For FileImporter, it iterates on os.listdir() on the importer path.

For zipimporter, it iterates on zipimport._zip_directory_cache[importer.archive].
msg385208 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-18 17:41
Note for myself: regen-keyword should be included in "make regen-all".
msg385250 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2021-01-19 10:03
A list of stdlib modules/extensions is IMHO problematic for maintenance, esp. if you consider that not all modules/extensions are installed on all systems (both because dependencies aren't present and because packagers have decided to unbundle parts of the stdlib).

Wouldn't it be sufficient to somehow mark the stdlib entries on sys.path? Although that might give misleading answers with tools like pyinstaller/py2exe/py2app that package an application and its dependencies into a single zipfile.
msg385251 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 11:09
Ronald Oussoren:
> A list of stdlib modules/extensions is IMHO problematic for maintenance, esp. if you consider that not all modules/extensions are installed on all systems (both because dependencies aren't present and because packagers have decided to unbundle parts of the stdlib).

My PR 24238 adds a list of module names (tuple of str): sys.module_names. It includes modules which are not available on the platform. For example, "winsound" is listed on Linux but not available, and "tkinter" is listed even if the extension could not be built (missing dependency).

I tried to make it clear in the sys.module_names documentation (see my PR).

Making the list conditional depending if the module is built or not is causing different issues. For example, for the "isort" (sort and group imports) use case, you want to know on Linux if "import winsound" is a stdlib import or a third party import (same for Linux-only modules on Windows). Moreover, there is a practical issue: extension modules are only built by setup.py *after* the sys module is built. I don't want to rebuild the sys module if building an extension failed.

Even if a module is available (listed in sys.module_names and the file is present on disk), it doesn't mean that "it works". For example, "import multiprocessing" fails if there is no working lock implementation. Other modules have similar issues.


> Wouldn't it be sufficient to somehow mark the stdlib entries on sys.path? Although that might give misleading answers with tools like pyinstaller/py2exe/py2app that package an application and its dependencies into a single zipfile.

Having to actually import modules to check if it's a stdlib module or not is not convenient. Many stdlib modules have side effects on import. For example, "import antigravity" opens a web browser. An import can open files, spawn threads, run programs, etc.
msg385252 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2021-01-19 11:18
>> Wouldn't it be sufficient to somehow mark the stdlib entries on sys.path? Although that might give misleading answers with tools like pyinstaller/py2exe/py2app that package an application and its dependencies into a single zipfile.

> Having to actually import modules to check if it's a stdlib module or not is not convenient. Many stdlib modules have side effects on import. For example, "import antigravity" opens a web browser. An import can open files, spawn threads, run programs, etc.

You wouldn't necessarily have to import a module to test, this is something that could be added to importlib.  One (poorly thought out) option is to add sys._stdlib_path with the subsection of sys.path that contains the stdlib, and a function in importlib that returns if a spec is for a stdlib module.

The disadvantage of this, or for the most part anything but your initial proposal, is that might not be save to use a function in importlib in Py_FatalError.
msg385253 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 11:30
Myself:
> Having to actually import modules to check if it's a stdlib module or not is not convenient. Many stdlib modules have side effects on import.

The "trace" usecase needs an exhaustive list of all module names. It is even less convenient to have to list all Python modules available on the system only to check modules coming from the stdlib.


Ronald:
> You wouldn't necessarily have to import a module to test, this is something that could be added to importlib.  One (poorly thought out) option is to add sys._stdlib_path with the subsection of sys.path that contains the stdlib, and a function in importlib that returns if a spec is for a stdlib module.

I'm not sure how it would work. I listed different cases which have different constraints:

* From a module name, check if it's part of the stdlib or not
* From a module object, check if it's part of the stdlib or not

For the test on the module name, how would it work with sys._stdlib_path? Should you import the module and then check if its path comes from sys._stdlib_path?


Ronald:
> The disadvantage of this, or for the most part anything but your initial proposal, is that might not be save to use a function in importlib in Py_FatalError.

PR 24254 is a working implementation of my use case: only list third party extension modules on a Python fatal error. It relies on PR 24238 sys.module_names list. The implementation works when called from a signal handler (when faulthandler catch fatal signals like SIGSEGV), it avoids memory allocations on the heap (one of the limits of a signal handler).
msg385254 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2021-01-19 12:24
> On 19 Jan 2021, at 12:30, STINNER Victor <report@bugs.python.org> wrote:
> 
> Ronald:
>> You wouldn't necessarily have to import a module to test, this is something that could be added to importlib.  One (poorly thought out) option is to add sys._stdlib_path with the subsection of sys.path that contains the stdlib, and a function in importlib that returns if a spec is for a stdlib module.
> 
> I'm not sure how it would work. I listed different cases which have different constraints:
> 
> * From a module name, check if it's part of the stdlib or not
> * From a module object, check if it's part of the stdlib or not
> 
> For the test on the module name, how would it work with sys._stdlib_path? Should you import the module and then check if its path comes from sys._stdlib_path?

For a module name use importlib.util.find_spec() to locate the module (or toplevel package if this is a module in package). The new importlib function could then use the spec and sys._stdlib_path to check if the spec is one for a stdlib module.  This is pretty handwavy, but I do something similar in py2app (but completely based on paths calculated outside of the import machinery).

For a module object you can extract the spec from the object and use the same function. 

> 
> 
> Ronald:
>> The disadvantage of this, or for the most part anything but your initial proposal, is that might not be save to use a function in importlib in Py_FatalError.
> 
> PR 24254 is a working implementation of my use case: only list third party extension modules on a Python fatal error. It relies on PR 24238 sys.module_names list. The implementation works when called from a signal handler (when faulthandler catch fatal signals like SIGSEGV), it avoids memory allocations on the heap (one of the limits of a signal handler).

I think we agree on that point: my counter proposal won’t work in the faulthandler scenario, and may be problematic in the Py_FatalError case as well.

Ronald
msg385255 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2021-01-19 12:39
BTW. A list of stdlib module names is not sufficient to determine if a module is in the stdlib, thanks to .pth files it is possible to have entries on sys.path before the stdlib.
msg385270 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 16:15
Ronald:
> BTW. A list of stdlib module names is not sufficient to determine if a module is in the stdlib, thanks to .pth files it is possible to have entries on sys.path before the stdlib.

Yeah, I wrote it in my first message. I solved this issue with documentation:

"Note: If a third party module has the same name than a standard library module and it comes before the standard library in sys.path, it overrides the standard library module on import."

I updated sys.module_names documentation in my PR.

IMO it's an acceptable limitation.
msg385272 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 16:35
Ronald:
> I think we agree on that point: my counter proposal won’t work in the faulthandler scenario, and may be problematic in the Py_FatalError case as well.

The API providing a tuple of str (sys.module_names) works with the 4 use cases that I listed:

* faulthandler/Py_FatalError (dump third party extensions of sys.modules)
* isort (group stdlib imports)
* trace (don't trace stdlib modules)
* pypt (ignore stdlib modules when computing dependencies)


Ronald:
> Although that might give misleading answers with tools like pyinstaller/py2exe/py2app that package an application and its dependencies into a single zipfile.

These tools worked for years without sys.module_names and don't need to be modified to use sys.module_names.

sys.module_names is well defined, extract of its doc:

"The list is the same on all platforms. Modules which are not available on some platforms and modules disabled at Python build are also listed."

These packaging tools may require further checks than just checking if the name is in sys.module_names. These tools are complex anyway ;-)
msg385273 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 16:37
> One (poorly thought out) option is to add sys._stdlib_path with the subsection of sys.path that contains the stdlib, and a function in importlib that returns if a spec is for a stdlib module.

There is already sysconfig.get_paths()['stdlib']. Maybe we need to add a new key for extension modules.

I don't think that these two options are exclusive. For me, it can even be a similar but different use case.
msg385300 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-01-19 22:04
New changeset cad8020cb83ec6d904f874c0e4f599e651022196 by Victor Stinner in branch 'master':
bpo-42955: Add Python/module_names.h (GH-24258)
https://github.com/python/cpython/commit/cad8020cb83ec6d904f874c0e4f599e651022196
History
Date User Action Args
2021-01-19 22:04:57vstinnersetmessages: + msg385300
2021-01-19 21:45:40vstinnersetpull_requests: + pull_request23082
2021-01-19 16:37:42vstinnersetmessages: + msg385273
2021-01-19 16:35:33vstinnersetmessages: + msg385272
2021-01-19 16:15:13vstinnersetmessages: + msg385270
2021-01-19 12:39:23ronaldoussorensetmessages: + msg385255
2021-01-19 12:24:41ronaldoussorensetmessages: + msg385254
2021-01-19 11:30:46vstinnersetmessages: + msg385253
2021-01-19 11:18:36ronaldoussorensetmessages: + msg385252
2021-01-19 11:09:56vstinnersetmessages: + msg385251
2021-01-19 10:03:19ronaldoussorensetnosy: + ronaldoussoren
messages: + msg385250
2021-01-19 09:25:49corona10setnosy: + corona10
2021-01-18 23:28:49vstinnersetpull_requests: + pull_request23075
2021-01-18 17:41:45vstinnersetmessages: + msg385208
2021-01-18 15:06:38vstinnersetmessages: + msg385201
2021-01-18 14:54:00vstinnersetmessages: + msg385200
2021-01-18 13:54:44vstinnersetmessages: + msg385197
2021-01-18 12:36:12vstinnersetmessages: + msg385194
2021-01-18 12:18:48vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request23060
2021-01-18 10:42:55serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg385186
2021-01-18 10:15:17vstinnersetmessages: + msg385181
2021-01-18 10:08:34vstinnercreate