[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized#139159
Closed
swolchok wants to merge 7 commits intogh/swolchok/682/basefrom
Closed
[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized#139159swolchok wants to merge 7 commits intogh/swolchok/682/basefrom
swolchok wants to merge 7 commits intogh/swolchok/682/basefrom
Conversation
…Vectorized Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
This was referenced Oct 29, 2024
Closed
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
This was referenced Oct 29, 2024
This was referenced Oct 28, 2024
swolchok
added a commit
that referenced
this pull request
Oct 29, 2024
…Vectorized Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) ghstack-source-id: 250611665 Pull Request resolved: #139159
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
…bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D65120325 |
Contributor
Author
|
folding this one into #139081 because I am needing increasingly large parts of it there anyway |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).
Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (
objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filtfrom pytorch root directory afterpython setup.py develop); observed minor instruction scheduling changes but nothing more.Differential Revision: D65120325
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10