A refactor in #27946 introduced "unprotected" (not surrounded by GAP_Enter/GAP_Leave) GAP_ValueGlobalVariable calls. I believe this might be a GC hazard, because after updating to GAP 4.12.1 I started seeing aarch64 crashes on NixOS infrastructure such as:
#0 0x0000fffff79740e8 in wait4 ()
#1 0x0000fffff5dc6b78 in print_enhanced_backtrace ()
#2 0x0000fffff5dc8190 in sigdie ()
#3 0x0000fffff5dcb1c0 in cysigs_signal_handler ()
#4 0x0000fffff7ffb7cc in __kernel_rt_sigreturn ()
#5 0x0000ffff99a0bf28 in ConvString ()
#6 0x0000000000000000 in ?? ()
#7 0x0000000000000000 in ?? ()
#8 0x0000000000000000 in ?? ()
#9 0x0000ffff99989930 in Pr ()
#10 0x0000ffff9998aa18 in CloseOutput ()
#11 0x0000ffff99884828 in capture_stdout () at /build/sage-src-9.7/pkgs/sagemath-standard/sage/libs/gap/element.pyx:154
...
I also see cases where capture_stdout throws errors such as sage.libs.gap.util.GAPError: Error, Length: <list> must be a list (not the integer 255) and then crashes. Both types of errors are fixed by this ticket.
Note that I am nesting GAP_Enter/GAP_Leave calls because I didn't remove the preexisting calls inside capture_stdout. That's because I feared removing the innermost calls might create a new footgun (and I believe nested GAP_Enter/GAP_Leave calls are explicitly supported), but removing them should cause no problem. Removing them might even be preferable for performance reasons, I don't know.
This ticket's branch is based on #34391.
(Edit: I previously thought that holding onto a char* past the GAP_Enter/GAP_Leave blocks, as is done in GapElement's __str__ and _repr_ methods, could also cause GC hazards. It doesn't seem to be a problem in practice, though, and I am not qualified to tell if it's a problem in theory.)
Depends on #34391
CC: @embray @videlec @dimpase
Component: interfaces
Author: Mauricio Collares
Issue created by migration from https://trac.sagemath.org/ticket/34701
Edit: Removed branch. The code changes are now at #35114.
A refactor in #27946 introduced "unprotected" (not surrounded by
GAP_Enter/GAP_Leave)GAP_ValueGlobalVariablecalls. I believe this might be a GC hazard, because after updating to GAP 4.12.1 I started seeing aarch64 crashes on NixOS infrastructure such as:I also see cases where
capture_stdoutthrows errors such assage.libs.gap.util.GAPError: Error, Length: <list> must be a list (not the integer 255)and then crashes. Both types of errors are fixed by this ticket.Note that I am nesting
GAP_Enter/GAP_Leavecalls because I didn't remove the preexisting calls insidecapture_stdout. That's because I feared removing the innermost calls might create a new footgun (and I believe nestedGAP_Enter/GAP_Leavecalls are explicitly supported), but removing them should cause no problem. Removing them might even be preferable for performance reasons, I don't know.This ticket's branch is based on #34391.
(Edit: I previously thought that holding onto a
char*past theGAP_Enter/GAP_Leaveblocks, as is done in GapElement's__str__and_repr_methods, could also cause GC hazards. It doesn't seem to be a problem in practice, though, and I am not qualified to tell if it's a problem in theory.)Depends on #34391
CC: @embray @videlec @dimpase
Component: interfaces
Author: Mauricio Collares
Issue created by migration from https://trac.sagemath.org/ticket/34701
Edit: Removed branch. The code changes are now at #35114.