Log structured logging overhead to dynamo compile (kinda) by jamesjwu · Pull Request #136142 · pytorch/pytorch

jamesjwu · 2024-09-16T13:26:09Z

This adds structured logging overhead at a per compile basis to compilation metrics.

To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table.

Implementation notes:

If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis.
We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small.
I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though.

Test Plan:
Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq

Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this.

You can also look at samples for a more detailed log of this.

Reviewed By: oulgen

Differential Revision: D62643611

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

pytorch-bot · 2024-09-16T13:26:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136142

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 22 Cancelled Jobs, 2 Unrelated Failures

As of commit cda3c95 with merge base 1a86d8a ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 4, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 5, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 1, 2, lf.linux.2xlarge) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 2, 2, lf.linux.2xlarge) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 1, 2, lf.linux.2xlarge) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 2, 2, lf.linux.2xlarge) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-rocm6.1-py3.10 / test (default, 1, 2, linux.rocm.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-rocm6.1-py3.10 / test (default, 2, 2, linux.rocm.gpu) (gh)
##[error]The operation was canceled.
trunk / linux-focal-rocm6.1-py3.10 / test (distributed, 1, 1, linux.rocm.gpu) (gh)
##[error]The operation was canceled.
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
##[error]The operation was canceled.
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh)
##[error]The operation was canceled.
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
##[error]The operation was canceled.
trunk / pytorch-linux-focal-py3-clang9-android-ndk-r21e-build / build (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
trunk / win-vs2019-cpu-py3 / test (default, 2, 3, lf.windows.4xlarge.nonephemeral) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no runner)
trunk / win-vs2019-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral) (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-16T13:26:18Z

This pull request was exported from Phabricator. Differential Revision: D62643611

Summary: X-link: pytorch/pytorch#136142 Pull Request resolved: pytorch#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Reviewed By: oulgen Differential Revision: D62643611

facebook-github-bot · 2024-09-16T16:56:07Z

This pull request was exported from Phabricator. Differential Revision: D62643611

Summary: Pull Request resolved: #136142 X-link: pytorch/benchmark#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611

Summary: X-link: pytorch/pytorch#136142 Pull Request resolved: pytorch#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Reviewed By: oulgen Differential Revision: D62643611

facebook-github-bot · 2024-09-18T21:31:46Z

This pull request was exported from Phabricator. Differential Revision: D62643611

Summary: Pull Request resolved: #136142 X-link: pytorch/benchmark#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611

Summary: X-link: pytorch/pytorch#136142 Pull Request resolved: pytorch#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Reviewed By: oulgen Differential Revision: D62643611

facebook-github-bot · 2024-09-18T21:38:26Z

This pull request was exported from Phabricator. Differential Revision: D62643611

Summary: X-link: pytorch/pytorch#136142 Pull Request resolved: #2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Reviewed By: oulgen Differential Revision: D62643611 fbshipit-source-id: 9353d1dbb323079e292b9b4786604fc377971e13

facebook-github-bot · 2024-09-19T16:09:24Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-09-19T16:11:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…6142) Summary: X-link: pytorch/benchmark#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: pytorch#136142 Approved by: https://github.com/bobrenjc93

pytorch-bot bot added ciflow/inductor module: dynamo labels Sep 16, 2024

facebook-github-bot added the fb-exported label Sep 16, 2024

jamesjwu added the topic: not user facing topic category label Sep 16, 2024

jamesjwu force-pushed the export-D62643611 branch from 020d916 to f8cd161 Compare September 16, 2024 16:56

bobrenjc93 approved these changes Sep 18, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 18, 2024

jamesjwu force-pushed the export-D62643611 branch from f8cd161 to b03d9f6 Compare September 18, 2024 21:31

jamesjwu force-pushed the export-D62643611 branch from b03d9f6 to cda3c95 Compare September 18, 2024 21:38

pytorchmergebot added the merging label Sep 19, 2024

pytorchmergebot added the Merged label Sep 19, 2024

pytorchmergebot closed this in 803ce50 Sep 19, 2024

pytorchmergebot removed the merging label Sep 19, 2024

github-actions bot deleted the export-D62643611 branch October 20, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log structured logging overhead to dynamo compile (kinda)#136142

Log structured logging overhead to dynamo compile (kinda)#136142
jamesjwu wants to merge 1 commit intomainfrom
export-D62643611

jamesjwu commented Sep 16, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 16, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 16, 2024

Uh oh!

facebook-github-bot commented Sep 16, 2024

Uh oh!

facebook-github-bot commented Sep 18, 2024

Uh oh!

facebook-github-bot commented Sep 18, 2024

Uh oh!

facebook-github-bot commented Sep 19, 2024

Uh oh!

pytorchmergebot commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jamesjwu commented Sep 16, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136142

❌ 22 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

facebook-github-bot commented Sep 16, 2024

Uh oh!

facebook-github-bot commented Sep 16, 2024

Uh oh!

facebook-github-bot commented Sep 18, 2024

Uh oh!

facebook-github-bot commented Sep 18, 2024

Uh oh!

facebook-github-bot commented Sep 19, 2024

Uh oh!

pytorchmergebot commented Sep 19, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jamesjwu commented Sep 16, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 16, 2024 •

edited

Loading