Skip to content

Fix warp size#256

Merged
amd-sriram merged 9 commits intomasterfrom
fix_warp_size
Jul 15, 2025
Merged

Fix warp size#256
amd-sriram merged 9 commits intomasterfrom
fix_warp_size

Conversation

@amd-sriram
Copy link
Copy Markdown
Collaborator

@amd-sriram amd-sriram commented Jul 15, 2025

Replace C10_WARP_SIZE, constant WARP_SIZE, constant THREADS_PER_WARP, warp.*32 regex uses with at::cuda::warp_size() when not used device or global.

Tested with docker
registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16420_ubuntu22.04_py3.10_pytorch_lw_rocm7.0_internal_testing_2d567672

Affected extensions:

  1. fused rope - csrc/megatron/fused_rotary_positional_embedding.h
  2. scaled_masked_softmax_cuda - , csrc/megatron/scaled_masked_softmax.h
  3. generic_scaled_masked_softmax_cuda - csrc/megatron/generic_scaled_masked_softmax.h
  4. scaled_upper_triang_masked_softmax_cuda - csrc/megatron/scaled_upper_triang_masked_softmax.h
  5. group batch norm - apex/contrib/csrc/groupbn/batch_norm.h, apex/contrib/csrc/groupbn/batch_norm_add_relu.h
  6. multihead attention - apex/contrib/csrc/multihead_attn/softmax.cuh
  7. transducer - apex/contrib/csrc/transducer/transducer_joint_kernel.cu
  8. xentropy - apex/contrib/csrc/xentropy/xentropy_kernel.cu
  9. sync batch norm - csrc/welford.cu

The following UTs pass:

  1. tests/L0/run_transformer/test_fused_rope.py
  2. tests/L0/run_transformer/test_fused_softmax.py
  3. apex/contrib/test/groupbn/test_groupbn.py
  4. apex/contrib/test/groupbn/test_groupbn_channel_last.py
  5. apex/contrib/test/multihead_attn/test_mha_fused_softmax.py
  6. apex/contrib/test/transducer/test_transducer_joint.py
  7. apex/contrib/test/transducer/test_transducer_loss.py
  8. apex/contrib/test/test_label_smoothing.py
  9. tests/distributed/synced_batchnorm/python_single_gpu_unit_test.py
  10. tests/distributed/synced_batchnorm/single_gpu_unit_test.py
  11. tests/distributed/synced_batchnorm/test_batchnorm1d.py

Cherry-picked to release/1.4.0 branch via #257

Cherry-picked to release/1.5.0 branch via #258

Cherry-picked to release/1.6.0 branch via #259

Cherry-picked to release/1.7.0 branch via #260

@amd-sriram amd-sriram self-assigned this Jul 15, 2025
@amd-sriram amd-sriram merged commit 051cba7 into master Jul 15, 2025
@amd-sriram amd-sriram deleted the fix_warp_size branch July 15, 2025 15:37
@amd-sriram
Copy link
Copy Markdown
Collaborator Author

! cherry-pick --onto release/1.4.0 release/1.5.0 release/1.6.0 release/1.7.0

okakarpa pushed a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention
okakarpa pushed a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention
okakarpa pushed a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention
@okakarpa
Copy link
Copy Markdown
Collaborator

Created branch autogenerated/release/1.4.0_cherry-pick_pr-256 and #257. It contains a merge conflict. Please resolve it

Created branch autogenerated/release/1.5.0_cherry-pick_pr-256 and #258

Created branch autogenerated/release/1.6.0_cherry-pick_pr-256 and #259

Created branch autogenerated/release/1.7.0_cherry-pick_pr-256 and #260

amd-sriram added a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention

Co-authored-by: Sriram Kumar <skishore@amd.com>
amd-sriram added a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention

Co-authored-by: Sriram Kumar <skishore@amd.com>
amd-sriram added a commit that referenced this pull request Jul 15, 2025
* replace c10_warp_size in fused rope

* replace c10_warp_size in fused softmax

* replace c10_warp_size in group batch norm

* replace c10_warp_size in multiheadattention

* replace c10_warp_size in tramsducer

* replace c10_warp_size in xentropy

* replace c10_warp_size in sync batch normalization

* replace c10_warp_size in group batch norm

* replace warp_size in multihead attention

Co-authored-by: Sriram Kumar <skishore@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants