Skip to content

filed: fix python plugin crash on python <3.10#1889

Merged
BareosBot merged 4 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-python-caching-issue
Aug 8, 2024
Merged

filed: fix python plugin crash on python <3.10#1889
BareosBot merged 4 commits intobareos:masterfrom
sebsura:dev/ssura/master/fix-python-caching-issue

Conversation

@sebsura
Copy link
Contributor

@sebsura sebsura commented Jul 11, 2024

Pythons internal cache may contain dangling pointers to objects of already destroyed interpreters. This is thankfully very unlikely but it does still happen. As such we need to make sure to clear the cache whenever we destroy a subinterpreter.

For more details see the commit messages.

Thank you for contributing to the Bareos Project!

Please check

  • Short description and the purpose of this PR is present above this paragraph
  • Your name is present in the AUTHORS file (optional)

If you have any questions or problems, please give a comment in the PR.

Helpful documentation and best practices

Checklist for the reviewer of the PR (will be processed by the Bareos team)

Make sure you check/merge the PR using devtools/pr-tool to have some simple automated checks run and a proper changelog record added.

General
  • Is the PR title usable as CHANGELOG entry?
  • Purpose of the PR is understood
  • Commit descriptions are understandable and well formatted
  • Required backport PRs have been created
  • Correct milestone is set
Source code quality
  • Source code changes are understandable
  • Variable and function names are meaningful
  • Code comments are correct (logically and spelling)

@sebsura sebsura force-pushed the dev/ssura/master/fix-python-caching-issue branch from 78e2c89 to f409fcd Compare July 11, 2024 06:21
@sebsura sebsura requested a review from arogge July 12, 2024 05:46
@sebsura sebsura self-assigned this Jul 15, 2024
@pstorz pstorz requested a review from sduehr August 5, 2024 09:49
@sduehr sduehr added this to the 24.0.0 milestone Aug 7, 2024
Copy link
Member

@sduehr sduehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good.

@sebsura sebsura force-pushed the dev/ssura/master/fix-python-caching-issue branch from f409fcd to cb8c9f2 Compare August 8, 2024 11:00
sebsura and others added 4 commits August 8, 2024 12:05
I am not sure if this is necessary, but the documentation says:

```
The global interpreter lock need not be held, but may be held if
it is necessary to serialize calls to this function.
```

But it does not elaborate on *when* it is necessary to serialize those
calls.  As such I will pretend that it is also necessary and use the
gil to do so.
While it should technically be no problem to call the bareosfd capi
without holding the gil, we have to keep in mind that we are still
modifying shared global state and as such we need to serialize those
calls.  (Ab-)Using the gil to do so is the easiest way to fix these
problems.

I am not 100% sure how running multiple plugin jobs in parallel never
ran into issues considering that handlePluginEvent was constantly
clobbering the global "plugin context" in bareosfd.  Its possible that
im overlooking something.
Python uses a (Type, Name) -> Object cache for member lookups inside
of types.  This cache was shared globally by all interpreters up until
3.10, where it was made to be per-interpreter instead:

```
// typeobject.c
static struct method_cache_entry method_cache[1 << MCACHE_SIZE_EXP];
```

The cache does sanity checking to make sure to only return the correct
answers:
```
        if (method_cache[h].version == type->tp_version_tag &&
            method_cache[h].name == name) {
            method_cache_hits++;
            return method_cache[h].value;
        }
```

but the problem is that this does not correctly work with static
types:  Their type id (type->tp_version_tag) is the same for all
interpreters since they all reuse the same type.  This means that if
you ask for the same members in two different interpreters and those
member names are allocated *at the same address*, then you will get
the same object pointer back.  This object pointer may not belong to
one of those interpreters though and in fact can be a dangling pointer
if the other interpreter was already killed.

This actually happened at a customer when accessing
`datetime.datetime.max` where the `Lookup(datetime, max)` call
actually returned a cached result from a previous interpreter.  This
may have to do with string interning (which caused the 'max' name to
have the same address), but it could also have been a pure
coincidence.

Thankfully python offers a (stable api) call to clear the cache.  This
bug can be worked around by just calling that function everytime we
destroy an interpreter so that the map never contains dangling
pointers.  This obviously does not fix the problem completely but it
should at least fix it for cases where there are no concurrent python
jobs running.

Its possible that there are more suprises like this in the python code
base.
@BareosBot BareosBot force-pushed the dev/ssura/master/fix-python-caching-issue branch from f6f7650 to a42e3e3 Compare August 8, 2024 12:05
@BareosBot BareosBot merged commit 2184b4a into bareos:master Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants