Skip to content

torch.distrubuted: lazy import pdb only when calling breakpoint()#163000

Closed
kelu-wandb wants to merge 1 commit intopytorch:mainfrom
kelu-wandb:kelu-wandb/distributed-lazy-import-pdb
Closed

torch.distrubuted: lazy import pdb only when calling breakpoint()#163000
kelu-wandb wants to merge 1 commit intopytorch:mainfrom
kelu-wandb:kelu-wandb/distributed-lazy-import-pdb

Conversation

@kelu-wandb
Copy link
Copy Markdown
Contributor

@kelu-wandb kelu-wandb commented Sep 15, 2025

DRAFT NOT YET READY

Fixes #159645

It makes sense to import debugging libraries when actually using debugging tools.

This also avoids the following chain of imports in Python 3.13:
torch -> torch.distributed -> rlcompleter -> readline

Importing readline, in turn, attempts to access stdin, which deadlocks if run from a subprocess launched with process_group=0 or preexec_fn=setpgrp because it doesn't have access to stdin.

Testing

On Mac:

make setup-env PYTHON=python3.12
git submodule update --init --recursive
pip install -r .github/requirements/pip-requirements-macOS.txt
USE_DISTRIBUTED=1 python setup.py develop --cmake

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Sep 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163000

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit cbd3bab with merge base 1247dde (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 15, 2025
@linux-foundation-easycla
Copy link
Copy Markdown

CLA Not Signed

@github-actions
Copy link
Copy Markdown
Contributor

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the deadlock occur easily (eg testable)?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Sep 15, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 15, 2025
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Sep 16, 2025

need cla

@github-actions
Copy link
Copy Markdown
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@kelu-wandb
Copy link
Copy Markdown
Contributor Author

Re-created this PR as PR #171818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request oncall: distributed Add this issue/PR to distributed oncall triage queue open source Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import torch hangs when running in subprocess with preexec_fn=os.setpgrp, python >=3.13, conda env

4 participants