Log autotune choices and benchmark result to scuba/chrome trace#159496
Log autotune choices and benchmark result to scuba/chrome trace#159496wychi wants to merge 1 commit intopytorch:mainfrom
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @wychi, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
|
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159496
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit b39d6c9 with merge base e2ee9cf ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 19, "num_triton_choices": 18, "best_time": 0.03200000151991844, "best_triton_pos": 1, "best_triton_time": 0.3172159940004349 } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 19, "num_triton_choices": 18, "best_time": 0.03200000151991844, "best_triton_pos": 1, "best_triton_time": 0.3172159940004349 } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Pull Request resolved: pytorch#159496 Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 19, "num_triton_choices": 18, "best_time": 0.03200000151991844, "best_triton_pos": 1, "best_triton_time": 0.3172159940004349 } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037
|
@pytorchbot label "topic: not user facing" |
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 19, "num_triton_choices": 18, "best_time": 0.03200000151991844, "best_triton_pos": 1, "best_triton_time": 0.3172159940004349 } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Pull Request resolved: pytorch#159496 Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 19, "num_triton_choices": 18, "best_time": 0.03200000151991844, "best_triton_pos": 1, "best_triton_time": 0.3172159940004349 } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D79235037
27e8dee to
e24d823
Compare
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_REPORT_AUTOTUNE_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Pull Request resolved: pytorch#159496 Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
…rch#159496) Summary: Pull Request resolved: pytorch#159496 Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
…rch#159496) Summary: Pull Request resolved: pytorch#159496 Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
|
This pull request was exported from Phabricator. Differential Revision: D79235037 |
|
@pytorchbot merge this |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels.
Example
Event: mm_template_autotuning
Column: autotune_choices
{ "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" }Test Plan:
Rollback Plan:
Differential Revision: D79235037
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben