document and maybe report deadlock in python 3.10's subprocess.Popen.communicate

# The problem

Without `--model-unblocked-syscall-latency`, running the following python script (with python 3.10) ends up looping forever in shadow:

```python
import subprocess as sp

p = sp.Popen(['sleep', '1'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
while True:
    try:
        p.communicate(timeout=0.01)
    except sp.TimeoutExpired:
        print("timeout")
        continue
    break
print("done")
```

# Debugging

We can see a single-step through the relevant python code by inserting a ptyon `breakpoint()` statement  to break into `pdb`... 

```diff
@@ -3,6 +3,7 @@ import subprocess as sp
 p = sp.Popen(['sleep', '1'], stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
 while True:
     try:
+        breakpoint()
         p.communicate(timeout=0.01)
     except sp.TimeoutExpired:
         print("timeout")
```

...and piping in an infinite sequence of s (single-step) commands

```
$ shadow-exec -- bash -c 'yes s | python3 subprocess_test.py'
```

We get a lot of output, but I think the main line of interest is at https://github.com/python/cpython/blob/9524203deefb6d4ea6a502661f855961dee1af85/Lib/subprocess.py#L2013

```
if timeout is not None and timeout < 0:
```

Basically what happens is:
* The first time around, a python `select` statement is executed with the original timeout of 0.01s. (this ends up being a `poll` syscall on my machine)
* This returns exactly 0.01s later with no ready events
* The subprocess module calculates the remaining time from the original timeout and gets *exactly* 0
* In the line referenced above, since this means that it instead of raising a timeout exception, it ends up calling select again, now with a timeout of exactly 0
* The `poll` syscall returns immediately and we do the loop again

# TODO

This can be worked around in shadow with `--model-unblocked-syscall-latency`

One might be able to weakly argue that this is a bug in Python... if one believes it's unsafe to assume the clock will at some point tick forward. Outside of shadow *I guess* that's generally a pretty safe assumption though. Still, maybe python can be "fixed" here if we can convince them that such a fix is at least unlikely to do harm.

TODO:
* check what other versions of python are affected, particularly whether this is still the behavior at HEAD
* maybe file an issue and/or send a PR to python to fix it
* add to shadow's compatibility notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document and maybe report deadlock in python 3.10's subprocess.Popen.communicate #3697

The problem

Debugging

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

document and maybe report deadlock in python 3.10's subprocess.Popen.communicate #3697

Description

The problem

Debugging

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions