This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Evil Trashcan and GC interaction
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: nascheme Nosy List: gvanrossum, nascheme, tim.peters, tismer, zerohero
Priority: high Keywords:

Created on 2002-03-27 20:47 by gvanrossum, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tup.py tim.peters, 2002-03-27 21:04 Test case to provoke the crash
trashcan_gc.diff nascheme, 2002-03-27 22:18
lis.py gvanrossum, 2002-03-28 01:17
trash-2.1.2.diff gvanrossum, 2002-03-28 02:15 Patch for 2.1.2
ceval.c.diff gvanrossum, 2002-03-28 18:21 Separate fix for SETLOCAL() race in ceval.c
Messages (17)
msg10060 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-27 20:47
This bug is only a hypothetical bug -- from source 
code inspection it appears clear that it can happen, 
and we have a Zope user who is reporting crashes that 
are consistent with this, but it isn't confirmed that 
this is indeed the cause.  Tim is attempting to 
reproduce it while I'm typing this.

The trashcan code stores NULL or other values in the 
ob_type field of an object, while that object is still 
in the GC doubly-linked list of container objects.  
When the collector is invoked, it dereferences ob_type 
(specifically, in subtract_refs, it retrieves ob_type-
>tp_traverse and calls it).  Obviously, the value 
stored in ob_type by the trashcan code causes this to 
fail.

This bug occurs in all versions of Python that have 
the GC code enabled (i.e. 2.0 and above).  In 2.1 and 
before, the code looks different than in 2.2, but the 
effect is the same: the trashcan code abuses ob_type 
of an object that's in a GC generational chain.

If we fix this, a 2.1.3 release including the fix 
should be issued, and we should hold up the final 
release of 2.2.1 to include a fix as well.
msg10061 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-03-27 21:04
Logged In: YES 
user_id=31435

Attaching a brief program (tup.py) that provokes a crash 
quickly, at least on Windows.  The crash goes away if the 
trashcan limit is boosted above the depth of tuple nesting 
this program creates.
msg10062 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2002-03-27 21:55
Logged In: YES 
user_id=35752

The fix should be simple.  Coming soon. :-)
msg10063 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2002-03-27 22:18
Logged In: YES 
user_id=35752

Well, not as simple as I had hoped.  Attached is a rough
patch that fixes the bug (I think, I haven't had time to
really study the interaction between GC and the trashcan
macros).
msg10064 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 01:17
Logged In: YES 
user_id=6380

Alas, that fixes it for tuples, but not for lists. I've
uploaded the list example as lis.py.

We also need a fix for 2.1.2 (to help the Zope user who
originally ran into this) and one for 2.2.1 -- the latter is
needed quickly if we want to make the final release.
msg10065 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 02:15
Logged In: YES 
user_id=6380

Here's a patch for the 2.1 maintenance branch. AFAICT this
fixes the problem for lists, tuples and dicts, which is a
100% fix (the other trashcan objects, frames and tracebacks,
don't use GC).

Note: the dict test case is just like the tuple or list
testcase, but has
  t = {1: t, 2: Ouch()}
as the last line of f().
msg10066 - (view) Author: Matthew T. Kromer (zerohero) Date: 2002-03-28 14:10
Logged In: YES 
user_id=378059

Leo in Brazil (the Zope customer experiencing the problem)
has bumped the trashcan depth limit up to 5 million, and
been stable for the entire overnight and morning production
periods.

He confirms the description and I believe he will be trying
the patch today.
msg10067 - (view) Author: Christian Tismer (old) (tismer) Date: 2002-03-28 14:20
Logged In: YES 
user_id=105700

Ouchh!
I will look into this and find an alternative
implementation. My current abuse of the type field
interacts with that fact that GC still needs it.
We just moved into a new house, I will not be
online often over Easter. Will sove it in the
coming week.
ciao - chris
msg10068 - (view) Author: Christian Tismer (old) (tismer) Date: 2002-03-28 14:20
Logged In: YES 
user_id=105700

Ouchh!
I will look into this and find an alternative
implementation. My current abuse of the type field
interacts with that fact that GC still needs it.
We just moved into a new house, I will not be
online often over Easter. Will sove it in the
coming week.
ciao - chris
msg10069 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2002-03-28 15:58
Logged In: YES 
user_id=35752

I'm not sure the abuse of the ob_type pointer is the only
problem we are facing here.  I've have a patch that changes
trashcan so that it doesn't overwrite the ob_type pointer.
I can still make the interpreter crash.  I'm still trying
to understand this problem completely. :-(
msg10070 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 16:28
Logged In: YES 
user_id=6380

Well, it *also* abuses the ob_refcnt field.

How about this wild idea (which Tim & I had simultaneously
yesterday): change the trashcan code to simply leave the
object in the GC generational list, and let the GC code
special-case objects with zero refcnt so that they are
eventually properly disposed?
msg10071 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 18:21
Logged In: YES 
user_id=6380

Looks like there are at least two independent bugs exposed
by the lis.py example in the CVS trunk/head: on Windows,
traversing the frame object can encounter freed object
because of the way the SETLOCAL() macro first decrefs
fastlocals[i] and then assigns it a new value. I've got a
fix for this (separate, see ceval.c.diff attached). On
Linux, the traceback points to the brand new semaphore code;
haven't researched this further yet.
msg10072 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 19:32
Logged In: YES 
user_id=6380

The new semaphore code is innocent -- it wasn't even used in
this case.

I cannot reproduce the bug with Neil's patch + my ceval.c
patch any more, so I consider this case closed.

I'll check the necessary things in to the 2.1 and 2.2
maintenance branches. Neil, can you check in your patch to
the trunk?
msg10073 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2002-03-28 20:44
Logged In: YES 
user_id=6380

Closing this; fixes have been checked in.

Maybe someone (Tim?) can check in the test case? Only the
trunk seems sufficient.

Christian: no need for a new trashcan IMO.
msg10074 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-03-28 22:10
Logged In: YES 
user_id=31435

Just noting that I added a test case to test_gc.py in the 
trunk.
msg10075 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2002-03-28 22:58
Logged In: YES 
user_id=35752

I'm working on the trashcan code fix right now (use gc_prev
instead of ob_type).  I was going to check it in but an
interpreter built --without-cycle-gc and with Py_DEBUG
defined crashes.  I don't know if our recent changes broke
things of if --without-cycle-gc has been broken for a while.
msg10076 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-03-28 23:28
Logged In: YES 
user_id=31435

Sorry, Neil, no clue -- PythonLabs likely builds every day 
in release and debug (Py_DEBUG) modes, each using all 
platform defaults, but I don't believe any of us ever 
builds in any other way.
History
Date User Action Args
2022-04-10 16:05:10adminsetgithub: 36341
2002-03-27 20:47:53gvanrossumcreate