Skip to content

document and maybe report deadlock in python 3.10's subprocess.Popen.communicate #3697

@sporksmith

Description

@sporksmith

The problem

Without --model-unblocked-syscall-latency, running the following python script (with python 3.10) ends up looping forever in shadow:

import subprocess as sp

p = sp.Popen(['sleep', '1'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
while True:
    try:
        p.communicate(timeout=0.01)
    except sp.TimeoutExpired:
        print("timeout")
        continue
    break
print("done")

Debugging

We can see a single-step through the relevant python code by inserting a ptyon breakpoint() statement to break into pdb...

@@ -3,6 +3,7 @@ import subprocess as sp
 p = sp.Popen(['sleep', '1'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
 while True:
     try:
+        breakpoint()
         p.communicate(timeout=0.01)
     except sp.TimeoutExpired:
         print("timeout")

...and piping in an infinite sequence of s (single-step) commands

$ shadow-exec -- bash -c 'yes s | python3 subprocess_test.py'

We get a lot of output, but I think the main line of interest is at https://github.com/python/cpython/blob/9524203deefb6d4ea6a502661f855961dee1af85/Lib/subprocess.py#L2013

if timeout is not None and timeout < 0:

Basically what happens is:

  • The first time around, a python select statement is executed with the original timeout of 0.01s. (this ends up being a poll syscall on my machine)
  • This returns exactly 0.01s later with no ready events
  • The subprocess module calculates the remaining time from the original timeout and gets exactly 0
  • In the line referenced above, since this means that it instead of raising a timeout exception, it ends up calling select again, now with a timeout of exactly 0
  • The poll syscall returns immediately and we do the loop again

TODO

This can be worked around in shadow with --model-unblocked-syscall-latency

One might be able to weakly argue that this is a bug in Python... if one believes it's unsafe to assume the clock will at some point tick forward. Outside of shadow I guess that's generally a pretty safe assumption though. Still, maybe python can be "fixed" here if we can convince them that such a fix is at least unlikely to do harm.

TODO:

  • check what other versions of python are affected, particularly whether this is still the behavior at HEAD
  • maybe file an issue and/or send a PR to python to fix it
  • add to shadow's compatibility notes

Metadata

Metadata

Assignees

Labels

Component: DocumentationIn-repository documentation, under docs/Type: BugError or flaw producing unexpected results

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions