Skip to content

[fr] fix OSS broken flight recorder#140973

Closed
c-p-i-o wants to merge 1 commit intopytorch:mainfrom
c-p-i-o:export-D66117013
Closed

[fr] fix OSS broken flight recorder#140973
c-p-i-o wants to merge 1 commit intopytorch:mainfrom
c-p-i-o:export-D66117013

Conversation

@c-p-i-o
Copy link
Contributor

@c-p-i-o c-p-i-o commented Nov 18, 2024

Summary:
OSS flight recorder does not work because we renamed trace_dir to folder in the internal version to reuse the loader.

Fixes item #2 in reported issue:
#140879

Test Plan:
BEFORE:

❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main
    details, version = read_dir(args)
  File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir
    assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
AttributeError: 'Namespace' object has no attribute 'folder'

AFTER:

python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
❯ git status
fatal: not a git repository (or any parent up to mount point /data/users/cpio)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}

Differential Revision: D66117013

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 18, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140973

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 066d5b1 with merge base 08cb516 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2024
@wconstab
Copy link
Contributor

@c-p-i-o do we have any actualy test for e2e usage in OSS? would be good to have some coverage

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

@c-p-i-o c-p-i-o force-pushed the export-D66117013 branch 2 times, most recently from 5c6c3e6 to b4b586b Compare November 19, 2024 19:53
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

Summary:

OSS flight recorder analyzer does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader.

Fixes item #2 in reported issue:
#140879

Test Plan:
BEFORE:
```
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main
    details, version = read_dir(args)
  File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir
    assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
AttributeError: 'Namespace' object has no attribute 'folder'
```

AFTER:
```
python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}

```

Reviewed By: fduwjj

Differential Revision: D66117013
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66117013

@c-p-i-o
Copy link
Contributor Author

c-p-i-o commented Nov 20, 2024

@c-p-i-o do we have any actualy test for e2e usage in OSS? would be good to have some coverage

We don't have tests for the analyzer portions in OSS. Let me add some.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Summary:
OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader.

Fixes item pytorch#2 in reported issue:
pytorch#140879

Test Plan:
BEFORE:
```
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main
    details, version = read_dir(args)
  File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir
    assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
AttributeError: 'Namespace' object has no attribute 'folder'
```

AFTER:
```
python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
❯ git status
fatal: not a git repository (or any parent up to mount point /data/users/cpio)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
```

Differential Revision: D66117013

Pull Request resolved: pytorch#140973
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants