torch.distrubuted: lazy import pdb only when user calls breakpoint()#171818
Closed
kelu-wandb wants to merge 2 commits intopytorch:mainfrom
Closed
torch.distrubuted: lazy import pdb only when user calls breakpoint()#171818kelu-wandb wants to merge 2 commits intopytorch:mainfrom
kelu-wandb wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/171818
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 35812da with merge base 7e5e018 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Skylion007
reviewed
Jan 6, 2026
kelu-wandb
commented
Jan 6, 2026
Contributor
Author
kelu-wandb
left a comment
There was a problem hiding this comment.
Fixed based on comment.
Contributor
Author
|
(This is take 2 of PR #163000, which expired because I didn't get to the CLA signing in time. Same code, but re-tested.) |
Contributor
Author
|
@pytorchbot label "release notes: distributed (c10d)" |
Contributor
|
@pytorchbot merge |
Collaborator
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
krastogi-in
pushed a commit
to krastogi-in/pytorch
that referenced
this pull request
Jan 9, 2026
…ytorch#171818) Fixes pytorch#159645 Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code. In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from : `torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline` Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin. Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary. ## Testing (All tests below on Mac.) ### Test script: `deadline_minimal.py`: ``` import sys import subprocess if __name__ == "__main__": code = """ print('importing torch...') import sys import torch print('imported torch.') if "pdb" in sys.modules: print("ERROR: pdb imported") exit(1) """ kwargs = dict(process_group=0) proc = subprocess.Popen([sys.executable, "-c", code], **kwargs) try: proc.communicate(timeout=20) if proc.returncode == 0: print("PASS") else: print("FAIL") except subprocess.TimeoutExpired: print("FAIL: Process deadlocked after 20 seconds") proc.kill() ``` ### Failure repro: python 3.13.7, old pytorch Deadlocks: ``` % conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.7 % python deadlock_minimal.py importing torch... FAIL: Process deadlocked after 10 seconds ``` ### Failure repro: python 3.13.8, new pytorch Does not deadlock due to underlying python fix, but still imports pdb: ``` % conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.8 % python deadlock_minimal.py imported torch. ERROR: pdb imported FAIL ``` ### Fix confirmation: python 3.13.7, new pytorch No longer deadlocks, does not import pdb. ``` % conda create -n "pytorch-3.13.7" python=3.13.7 % conda activate pytorch-3.13.7 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ``` % conda create -n "pytorch-3.13.11" python=3.13.11 % conda activate pytorch-3.13.11 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ### Test that `torch.distributed.breakpoint()` still works: `torch_breakpoint.py`: ``` import sys import torch.distributed as dist print(f"is available: {dist.is_available()}") dist.init_process_group() dist.breakpoint(rank = 0) print(f"pdb imported after breakpoint: {"pdb" in sys.modules}") ``` Then built with distributed on Mac and did a basic test: ``` % USE_DISTRIBUTED=1 python setup.py build --cmake % RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py is available: True # snipped some errors due to not actually setting up a full scenario > /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint() -> pdb.set_trace() (Pdb) ``` Pull Request resolved: pytorch#171818 Approved by: https://github.com/ezyang
hinriksnaer
pushed a commit
to hinriksnaer/pytorch
that referenced
this pull request
Jan 12, 2026
…ytorch#171818) Fixes pytorch#159645 Makes `torch/distributed/__init__.py` only import `pdb` when needed, because we should avoid debugging-specific dependencies in production code. In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from : `torch` -> `torch.distributed` -> `pdb` -> `rlcompleter` -> `readline` Importing `readline`, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with `process_group=0` or `preexec_fn=setpgrp` because it doesn't have access to stdin. Python 3.13.8 [fixed the `pdb` -> `rlcompleter` -> `readline` dependency](python/cpython#139280), but it's still good to import `pdb` only when necessary. ## Testing (All tests below on Mac.) ### Test script: `deadline_minimal.py`: ``` import sys import subprocess if __name__ == "__main__": code = """ print('importing torch...') import sys import torch print('imported torch.') if "pdb" in sys.modules: print("ERROR: pdb imported") exit(1) """ kwargs = dict(process_group=0) proc = subprocess.Popen([sys.executable, "-c", code], **kwargs) try: proc.communicate(timeout=20) if proc.returncode == 0: print("PASS") else: print("FAIL") except subprocess.TimeoutExpired: print("FAIL: Process deadlocked after 20 seconds") proc.kill() ``` ### Failure repro: python 3.13.7, old pytorch Deadlocks: ``` % conda create -n "pytorch-pdb-3.13.7" python=3.13.7 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.7 % python deadlock_minimal.py importing torch... FAIL: Process deadlocked after 10 seconds ``` ### Failure repro: python 3.13.8, new pytorch Does not deadlock due to underlying python fix, but still imports pdb: ``` % conda create -n "pytorch-pdb-3.13.8" python=3.13.8 numpy pytorch -c conda-forge -y % conda activate pytorch-pdb-3.13.8 % python deadlock_minimal.py imported torch. ERROR: pdb imported FAIL ``` ### Fix confirmation: python 3.13.7, new pytorch No longer deadlocks, does not import pdb. ``` % conda create -n "pytorch-3.13.7" python=3.13.7 % conda activate pytorch-3.13.7 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ``` % conda create -n "pytorch-3.13.11" python=3.13.11 % conda activate pytorch-3.13.11 % pip install --group dev % conda install pkg-config libuv % USE_DISTRIBUTED=1 python -m pip install --no-build-isolation -v -e . % python deadlock_minimal.py importing torch... imported torch. PASS ``` ### Test that `torch.distributed.breakpoint()` still works: `torch_breakpoint.py`: ``` import sys import torch.distributed as dist print(f"is available: {dist.is_available()}") dist.init_process_group() dist.breakpoint(rank = 0) print(f"pdb imported after breakpoint: {"pdb" in sys.modules}") ``` Then built with distributed on Mac and did a basic test: ``` % USE_DISTRIBUTED=1 python setup.py build --cmake % RANK=0 WORLD_SIZE=1 MASTER_ADDR=127.0.0.1 MASTER_PORT=49999 python torch_breakpoint.py is available: True # snipped some errors due to not actually setting up a full scenario > /Users/kelu/kelu-wandb/pytorch/torch/distributed/__init__.py(121)breakpoint() -> pdb.set_trace() (Pdb) ``` Pull Request resolved: pytorch#171818 Approved by: https://github.com/ezyang
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #159645
Makes
torch/distributed/__init__.pyonly importpdbwhen needed, because we should avoid debugging-specific dependencies in production code.In Python 3.13.1 through 3.13.7, this also avoids the following chain of imports from :
torch->torch.distributed->pdb->rlcompleter->readlineImporting
readline, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched withprocess_group=0orpreexec_fn=setpgrpbecause it doesn't have access to stdin.Python 3.13.8 fixed the
pdb->rlcompleter->readlinedependency, but it's still good to importpdbonly when necessary.Testing
(All tests below on Mac.)
Test script:
deadline_minimal.py:Failure repro: python 3.13.7, old pytorch
Deadlocks:
Failure repro: python 3.13.8, new pytorch
Does not deadlock due to underlying python fix, but still imports pdb:
Fix confirmation: python 3.13.7, new pytorch
No longer deadlocks, does not import pdb.
Test that
torch.distributed.breakpoint()still works:torch_breakpoint.py:Then built with distributed on Mac and did a basic test: