Skip to content

torch.distrubuted: lazy import pdb only when user calls breakpoint()#171818

Closed
kelu-wandb wants to merge 2 commits intopytorch:mainfrom
kelu-wandb:distributed-lazy-import-pdb-2
Closed

torch.distrubuted: lazy import pdb only when user calls breakpoint()#171818
kelu-wandb wants to merge 2 commits intopytorch:mainfrom
kelu-wandb:distributed-lazy-import-pdb-2

Conversation

@kelu-wandb
Copy link
Copy Markdown
Contributor

@kelu-wandb kelu-wandb commented Jan 6, 2026

Fixes #159645

Makes torch/distributed/__init__.py only import pdb when needed, because we should avoid debugging-specific dependencies in production code.

In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from :
torch -> torch.distributed -> pdb -> rlcompleter -> readline

Importing readline, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with process_group=0 or preexec_fn=setpgrp because it doesn't have access to stdin.

Python 3.13.8 fixed the pdb -> rlcompleter -> readline dependency, but it's still good to import pdb only when necessary.

Testing

(All tests below on Mac.)

Test script:

deadline_minimal.py:

import sys
import subprocess

if __name__ == "__main__":
    code = """
print('importing torch...')
import sys
import torch
print('imported torch.')
if "pdb" in sys.modules:
    print("ERROR: pdb imported")
    exit(1)
"""

    kwargs = dict(process_group=0)
    proc = subprocess.Popen([sys.executable, "-c", code], **kwargs)
    try:
        proc.communicate(timeout=20)
        if proc.returncode == 0:
            print("PASS")
        else:
            print("FAIL")
    except subprocess.TimeoutExpired:
        print("FAIL: Process deadlocked after 20 seconds")
        proc.kill()

Failure repro: python 3.13.7, old pytorch

Deadlocks:

% conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.7
% python deadlock_minimal.py
importing torch...
FAIL: Process deadlocked after 10 seconds

Failure repro: python 3.13.8, new pytorch

Does not deadlock due to underlying python fix, but still imports pdb:

% conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.8
% python deadlock_minimal.py
imported torch.
ERROR: pdb imported
FAIL

Fix confirmation: python 3.13.7, new pytorch

No longer deadlocks, does not import pdb.

% conda create -n "pytorch-3.13.7" python=3.13.7
% conda activate pytorch-3.13.7
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS
% conda create -n "pytorch-3.13.11" python=3.13.11
% conda activate pytorch-3.13.11
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS

Test that torch.distributed.breakpoint() still works:

torch_breakpoint.py:

import sys
import torch.distributed as dist
print(f"is available: {dist.is_available()}")
dist.init_process_group()
dist.breakpoint(rank = 0)
print(f"pdb imported after breakpoint: {"pdb" in sys.modules}")

Then built with distributed on Mac and did a basic test:

% USE_DISTRIBUTED=1 python setup.py build --cmake
% RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py
is available: True
# snipped some errors due to not actually setting up a full scenario
> /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint()
-> pdb.set_trace()
(Pdb)

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171818

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 35812da with merge base 7e5e018 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Jan 6, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@kelu-wandb kelu-wandb changed the title torch.distrubuted: lazy import pdb only when calling breakpoint() (take 2) torch.distrubuted: lazy import pdb only when user calls breakpoint() Jan 6, 2026
Copy link
Copy Markdown
Contributor Author

@kelu-wandb kelu-wandb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed based on comment.

@kelu-wandb
Copy link
Copy Markdown
Contributor Author

(This is take 2 of PR #163000, which expired because I didn't get to the CLA signing in time. Same code, but re-tested.)

@kelu-wandb
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: distributed (c10d)"

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jan 6, 2026
@kelu-wandb kelu-wandb marked this pull request as ready for review January 6, 2026 22:44
@kelu-wandb kelu-wandb requested a review from Skylion007 January 7, 2026 00:02
Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, appreciated

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Jan 7, 2026

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

krastogi-in pushed a commit to krastogi-in/pytorch that referenced this pull request Jan 9, 2026
…ytorch#171818)

Fixes pytorch#159645

Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code.

In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from :
`torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline`

Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin.

Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary.

## Testing

(All tests below on Mac.)

### Test script:

`deadline_minimal.py`:

```
import sys
import subprocess

if __name__ == "__main__":
    code = """
print('importing torch...')
import sys
import torch
print('imported torch.')
if "pdb" in sys.modules:
    print("ERROR: pdb imported")
    exit(1)
"""

    kwargs = dict(process_group=0)
    proc = subprocess.Popen([sys.executable, "-c", code], **kwargs)
    try:
        proc.communicate(timeout=20)
        if proc.returncode == 0:
            print("PASS")
        else:
            print("FAIL")
    except subprocess.TimeoutExpired:
        print("FAIL: Process deadlocked after 20 seconds")
        proc.kill()
```

### Failure repro: python 3.13.7, old pytorch

Deadlocks:

```
% conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.7
% python deadlock_minimal.py
importing torch...
FAIL: Process deadlocked after 10 seconds
```

### Failure repro: python 3.13.8, new pytorch

Does not deadlock due to underlying python fix, but still imports pdb:

```
% conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.8
% python deadlock_minimal.py
imported torch.
ERROR: pdb imported
FAIL
```

### Fix confirmation: python 3.13.7, new pytorch

No longer deadlocks, does not import pdb.

```
% conda create -n "pytorch-3.13.7" python=3.13.7
% conda activate pytorch-3.13.7
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS
```

```
% conda create -n "pytorch-3.13.11" python=3.13.11
% conda activate pytorch-3.13.11
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS
```

### Test that `torch.distributed.breakpoint()` still works:

`torch_breakpoint.py`:
```
import sys
import torch.distributed as dist
print(f"is available: {dist.is_available()}")
dist.init_process_group()
dist.breakpoint(rank = 0)
print(f"pdb imported after breakpoint: {"pdb" in sys.modules}")
```

Then built with distributed on Mac and did a basic test:

```
% USE_DISTRIBUTED=1 python setup.py build --cmake
% RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py
is available: True
# snipped some errors due to not actually setting up a full scenario
> /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint()
-> pdb.set_trace()
(Pdb)
```

Pull Request resolved: pytorch#171818
Approved by: https://github.com/ezyang
hinriksnaer pushed a commit to hinriksnaer/pytorch that referenced this pull request Jan 12, 2026
…ytorch#171818)

Fixes pytorch#159645

Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code.

In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from :
`torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline`

Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin.

Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary.

## Testing

(All tests below on Mac.)

### Test script:

`deadline_minimal.py`:

```
import sys
import subprocess

if __name__ == "__main__":
    code = """
print('importing torch...')
import sys
import torch
print('imported torch.')
if "pdb" in sys.modules:
    print("ERROR: pdb imported")
    exit(1)
"""

    kwargs = dict(process_group=0)
    proc = subprocess.Popen([sys.executable, "-c", code], **kwargs)
    try:
        proc.communicate(timeout=20)
        if proc.returncode == 0:
            print("PASS")
        else:
            print("FAIL")
    except subprocess.TimeoutExpired:
        print("FAIL: Process deadlocked after 20 seconds")
        proc.kill()
```

### Failure repro: python 3.13.7, old pytorch

Deadlocks:

```
% conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.7
% python deadlock_minimal.py
importing torch...
FAIL: Process deadlocked after 10 seconds
```

### Failure repro: python 3.13.8, new pytorch

Does not deadlock due to underlying python fix, but still imports pdb:

```
% conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y
% conda activate pytorch-pdb-3.13.8
% python deadlock_minimal.py
imported torch.
ERROR: pdb imported
FAIL
```

### Fix confirmation: python 3.13.7, new pytorch

No longer deadlocks, does not import pdb.

```
% conda create -n "pytorch-3.13.7" python=3.13.7
% conda activate pytorch-3.13.7
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS
```

```
% conda create -n "pytorch-3.13.11" python=3.13.11
% conda activate pytorch-3.13.11
% pip install --group dev
% conda install pkg-config libuv
% USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e .
% python deadlock_minimal.py
importing torch...
imported torch.
PASS
```

### Test that `torch.distributed.breakpoint()` still works:

`torch_breakpoint.py`:
```
import sys
import torch.distributed as dist
print(f"is available: {dist.is_available()}")
dist.init_process_group()
dist.breakpoint(rank = 0)
print(f"pdb imported after breakpoint: {"pdb" in sys.modules}")
```

Then built with distributed on Mac and did a basic test:

```
% USE_DISTRIBUTED=1 python setup.py build --cmake
% RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py
is available: True
# snipped some errors due to not actually setting up a full scenario
> /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint()
-> pdb.set_trace()
(Pdb)
```

Pull Request resolved: pytorch#171818
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import torch hangs when running in subprocess with preexec_fn=os.setpgrp, python >=3.13, conda env

5 participants