Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Sep 9, 2020

No description provided.

jeffra and others added 30 commits September 2, 2020 10:59
* update DSE to point to ZeRO-Offload staging

* ZeRO-2 enable CPU offload (#313)

* cpu-offload

* update

* deleted:    deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py
	modified:   deepspeed/pt/fp16_unfused_optimizer.py
	new file:   install_output.txt
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* update

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* deleted:    install_output.txt

* modified:   deepspeed/pt/fp16_unfused_optimizer.py
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* deleted:    deepspeed_cpu_adam.py
	modified:   deepspeed_light.py
	modified:   deepspeed_zero_optimizer.py
	../../deepspeed_zero_optimizer_cpu_offload.py

* modified:   deepspeed/pt/deepspeed_light.py

* modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   deepspeed/pt/deepspeed_zero_utils.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_config.py
	modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_checkpointing.py

* update DSE to ZeRO-Offload commit

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Enable ZeRO checkpointing for ZeRO-Offload (#337)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Add ZeRO-Offload checkpointing model tests (#344)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Fix ZeRO-Offload checkpointing bug when change gpu count
Add checkpointing model tests for ZeRO-Offload
Remove optimizer key from Megatron model tests
Use different deepspeed master port for Megatron model tests

Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* adding link to Sparse Attention in Navigation page
* Update test_sparse_attention.py

* jren changes

* Merge with correctness/perf fixes

* Formatting fixes

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* add cpu adam optimizer

* run precommit

* clean adam_test

* add accuracy test for adam
* fixing gradient accumulation for zero offload

* Bug fixes. ZeRO Stage 1,2 and Offload all produce the same loss with gradient accumulation step of 2
* use relative imports and add support for conditional op imports

* formatting and llvm command check change

* fix remaining absolute import

* hide the isntalled ops var

* fix unit tests

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
…PU (#360)

* Allocating CPU memory directly on CPU without transfering them from GPU

* formatting fixes
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
* Improve test for ZeRO supported optimizers

* Rename test function

* Format fixes

* Add model tests that wraps client FusedAdam with fused fp16 optimizer

* Format fixes
* fixing the cpu_adam API and add deepspeed_adam flag in config.py

* run precommit
* cpu_offload enables overlap_comm and contiguous_gradients
Remove non-portable tensor.mul_()

* Model functionality tests now passing

* Move to perf tests folder
…367)

* fixing adam copy fp16-param-add more compile flags for cpu_adam

* run precommit

* fix variance indexes

* fix array-sizes

* move adam_test

* rename perf test
tjruwase and others added 10 commits September 5, 2020 22:46
* Various correctness fixes

* Format fixes
* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting

* update Sparse Attention Tutorial

* fixed few issues and applied comments for better organization and readability

* updated sparse attention tutorial with making how to use section incremental; applying more comments

Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
* fixing corner cases

* revert to the previous perf for adam

* adam high performance

* run precommit
* Add ZeRO-Offload model tests
Restrict optimizer update+copy to DeepSpeedCPUAdam

* Format fixes

* Increate bucket size scaler
* fixing the compilation error for AVX2 architecture

* running precommit

* adding cpufeature to requirements

* Update install.sh

* Update install.sh

* include cpu-adam in the features

* update features

* update features

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
jeffra and others added 2 commits September 9, 2020 10:22
* add DS_BUILD_AVX512 flag and update the feature part accordingly

* run precommit
@jeffra jeffra changed the base branch from master to staging-zero-dual-v5 September 9, 2020 21:44
@jeffra jeffra changed the title ZeRO-Offload ZeRO-Offload (squash) Sep 9, 2020
@jeffra jeffra merged this pull request into staging-zero-dual-v5 Sep 9, 2020
jeffra added a commit that referenced this pull request Sep 9, 2020
* ZeRO-Offload v1 (squash) (#345)

* update DSE to point to ZeRO-Offload staging

* ZeRO-2 enable CPU offload (#313)

* cpu-offload

* update

* deleted:    deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py
	modified:   deepspeed/pt/fp16_unfused_optimizer.py
	new file:   install_output.txt
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* update

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* deleted:    install_output.txt

* modified:   deepspeed/pt/fp16_unfused_optimizer.py
	modified:   tests/unit/test_dynamic_loss_scale.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py

* modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* modified:   deepspeed/pt/deepspeed_cpu_adam.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py

* deleted:    deepspeed_cpu_adam.py
	modified:   deepspeed_light.py
	modified:   deepspeed_zero_optimizer.py
	../../deepspeed_zero_optimizer_cpu_offload.py

* modified:   deepspeed/pt/deepspeed_light.py

* modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   deepspeed/pt/deepspeed_zero_utils.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_config.py
	modified:   deepspeed/pt/deepspeed_light.py
	modified:   deepspeed/pt/deepspeed_zero_optimizer.py
	modified:   tests/unit/test_checkpointing.py
	modified:   tests/unit/test_fp16.py

* modified:   deepspeed/pt/deepspeed_checkpointing.py

* update DSE to ZeRO-Offload commit

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Enable ZeRO checkpointing for ZeRO-Offload (#337)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Add ZeRO-Offload checkpointing model tests (#344)

* Enable ZeRO checkpointing for ZeRO-Offload
Fix unit tests
Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397

* Fix accidental revert

* Fix ZeRO-Offload checkpointing bug when change gpu count
Add checkpointing model tests for ZeRO-Offload
Remove optimizer key from Megatron model tests
Use different deepspeed master port for Megatron model tests

Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update DSE to staging for zero-dual

* Update test_sparse_attention.py

* Assert ZeRO-Offload+gradient accumulation (#347)

* Adding link to Sparse Attention in Navigation page (#355)

* adding link to Sparse Attention in Navigation page

* Correctness and perf fixes (#354)

* Update test_sparse_attention.py

* jren changes

* Merge with correctness/perf fixes

* Formatting fixes

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add cpu adam optimizer (#356)

* add cpu adam optimizer

* run precommit

* clean adam_test

* add accuracy test for adam

* make the adam unit test work with random params and grads and for more steps

* Samyamr/zero offload correctness (#359)

* fixing gradient accumulation for zero offload

* Bug fixes. ZeRO Stage 1,2 and Offload all produce the same loss with gradient accumulation step of 2

* Import path fixes + conditional imports (#358)

* use relative imports and add support for conditional op imports

* formatting and llvm command check change

* fix remaining absolute import

* hide the isntalled ops var

* fix unit tests

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* Enable contiguous gradients for cpu_offload

* Allocating CPU memory directly on CPU without transfering them from GPU (#360)

* Allocating CPU memory directly on CPU without transfering them from GPU

* formatting fixes

* change gpt2 pretrain to have DeepSpeed adam (#361)

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* Jekyll installation instructions (#351)

* Generalize detection of ZeRO supported optimizers (#349)

* Improve test for ZeRO supported optimizers

* Rename test function

* Format fixes

* Add model tests that wraps client FusedAdam with fused fp16 optimizer

* Format fixes

* everything is working

* fixing the cpu_adam API and add deepspeed_adam flag in config.py (#365)

* fixing the cpu_adam API and add deepspeed_adam flag in config.py

* run precommit

* fixing adam copy fp16-param-add more compile flags for cpu_adam

* run precommit

* fix variance indexes

* fix array-sizes

* ZeRO-Offload passing model functionality tests (#366)

* cpu_offload enables overlap_comm and contiguous_gradients
Remove non-portable tensor.mul_()

* Model functionality tests now passing

* Move to perf tests folder

* move adam_test

* rename perf test

* fixing adam copy fp16-param and add more compile flags for cpu_adam (#367)

* fixing adam copy fp16-param-add more compile flags for cpu_adam

* run precommit

* fix variance indexes

* fix array-sizes

* move adam_test

* rename perf test

* Perf tests

* BumpDSE

* fixed a typo; this was fixed before but seems like it has been lost in the refactor (#364)

* Move code quality tests to Azure-hosted agents. (#368)

* add casting kernel

* run precommit

* revert changes

* revert changes

* ZeRO-Offload: Integration code fixes (#370)

* Various correctness fixes

* Format fixes

* Update installation instructions (#362)

* Update Sparse Attention Tutorial (#357)

* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting

* update Sparse Attention Tutorial

* fixed few issues and applied comments for better organization and readability

* updated sparse attention tutorial with making how to use section incremental; applying more comments

Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>

* fixing corner cases (#371)

* fix adam perormance (#372)

* fixing corner cases

* revert to the previous perf for adam

* adam high performance

* run precommit

* ZeRO-Offload passing model tests (#374)

* Add ZeRO-Offload model tests
Restrict optimizer update+copy to DeepSpeedCPUAdam

* Format fixes

* Increate bucket size scaler

* fix cpu adam compilation for AVX2 (#378)

* fixing the compilation error for AVX2 architecture

* running precommit

* adding cpufeature to requirements

* Update install.sh

* Update install.sh

* include cpu-adam in the features

* update features

* update features

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Move code quality tests to Azure-hosted agents. (#368)

* Bump DSE

* adding sparse attention to feature index page (#377)

* support avx2 by default (#383)

* add DS_BUILD_AVX512 flag and update the feature part accordingly

* run precommit

Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
jeffra added a commit that referenced this pull request Sep 10, 2020
* ZeRO-Offload (squash) (#381)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
@jeffra jeffra deleted the staging-zero-dual-v4 branch November 19, 2020 23:27
stephen-youn added a commit that referenced this pull request Jun 14, 2023
* Add residual_add triton op

* add support of gptj style models to triton residual_add kernel

* fix the residual_add tests

* Add support of end to end run for residal_add triton kernels

* Fix the MLP output tensor's shape

* Fix the output tensor of residual_add_func python call

* triton matmul kernels with python wrapper class added with pytests

* clean-up and make it read autotune table when importing

* fixed import problems with the naming

* enable update_autotune_table for every forward in matmul

* a int4 into int8 weight packing function added
test parameters with alignment only (i.e. integer multiple of block_size in matmul kernel), this will be further investigated

* lint

* quantization added
int8-packed-int4-fp16 matmul-block-deq added
illegal cuda mem access bug in triton matmul kernel fixed (i.e. a mem boundary problem)

* add torch block qunatization

* dual quantization matmul added

* cleanup, fix for lint

* documentation
lint fix

* README added

* typo

* updated the kernel to have fused bias additioin and activation too

* Add residual_add triton op

* modified quantization to take additional bits, more than int8

* enable triton residual_add kernel in DS MLP

* Add flash attention kernel and glue code

* additional scale-norm added for weight

* a temporary example for quantization added

* comments

* use the exact same ds quantizer as reference

* added scale-norm (i.e. scale-of-scale) to both triton/torch version

* snr check with fugsed-deq-gemm for block_deq and dual_block_deq

* makes matmul kernels work for a6000 with smaller mem
w8a8/w4a8 with sym block quantization on activation and row(or col)-wise quatnziation on weight works (snr test added)

* Add layer norm triton kernel

* Add gelu triton kernel

* Add softmax triton kernel

* Rename flash attn api

* add triton gemm kernels

* fix formatting of triton kernels

* Add matmul triton kernels

* Updated Triton Gelu to use non-approx computation

* Updated Triton Gemm for f16 bias-add parity

* Add DS triton encoder layer

* Updated Softmax to work around block size 1

* fix the issue caused by merge conflict

* Add trition layer norm unittests

* dual-qblock snr verified too

* Add triton gelu kernel unittests

* Add triton softmax kernel unittests

* fix flash kernels formatting (#382)

* Add triton dependency to unittests workflow (#381)

* w8a8 and w8a4 matmul with block quantization verified

* Allow Gemm & MatMul to take arbitrary dimensions

* Add triton matmul kernel unittests

* fix triton dependency in github CI workflows

* Fix matmul launching grid

* fix formatting

* Add triton gemm kernel unittests

* modified dual-qblock to support wider scale_bits with int64 acc and vec-ops, which caused perf degradation
workaround is to use "v2" kernel added with internal shift ops but not enabled yet

* fix residual in gemm_3d kernel

* Add flash attention trition kernels unit tests

* test_matmul and test_gemm pass (but with smaller coverage as mentioned in the code)
float32 can be supported later

* added 'triton_gemm_eval.py'
it is temporary script to evaluate accuracy of the triton matmul against the torch matmul

* typo

* typo

* root-caused the parity error with fused_gelu. it is not with gelu but with residual-addition.
disabled residual-addition and it still needs debugging

* location of residual addition in reference modified to be after the activation

* fixed index typo in the snr plot

* Fix trition attention kernel unit tests

* fix formatting

* added batch support in matmul
row/col-wise quantization matmul debugged

* fixed bugs in the unit tests after the batch support change and so on
test_int8_int8_fp_matmul_dual_block_deq still fails and need further debugging though

* weight-only quantizatioin example and test are added to check_snr

* matmul_ext basic check added as unit test under tests/unit

* move triton ops under inference/triton

* restore triton_ops.py

* import path correction

* restore ds_mlp and ds_attention

* shaping bug with batching in matmul_ext fixed
changed the gelu computation to use libdevice.erf instead of approx with sigmoid
(otherwise, roberta unit test fails)

* triton ops added with an option in config to use it with op_binding and config option

* Triton transformer added: InferenceTransformerFactory, TritonTransformer, TritonSelfAttention, TritonMLP and so forth

* Triton wrapper classes added

* added simple triton eval scripts

* rename the new benchmark script for triton-bert

* added triton attention, triton layer-norm/softmax

* adds tests to measure attention perf in triton and others

* changed triton flash attn function name

* attention set to use triton non-flash by default

* enable triton for bert

* made udpate_autotable be false by default because it degrade the perf

* temp commit with debugging/profiling codes

* temporary debugging/profiling code lines added, need to be cleaned up later

* clean-up

* unit tests for triton inference ops are now passing

* removed unnecessary triton kernels

* test_inference passes

* removed debugging/profiling codes

* triton==2.0.0.dev20221202

* clean-up for formating check pass
added layer_norm test without residual-add

* set triton version requirement

* further clean-up

* removed redundant files

* readme for triton matmul

* clean-up and add more test for triton-matmul

* typo

* removed another obsolete triton kernels and tests

* removed unnecessary TransformerInferenceFactory class

* removed obsolete test

* formatting check, cleanup

* formatting fix: added copyright to the head

* formatting: missing lticense added

* add pytest skip condition to test_matmul_ext

* formatting fix

* formatting

* added --forked option to inference_ops unit pytests

* Revert "added --forked option to inference_ops unit pytests"

This reverts commit 743b86d354b041172b06e4a8505f43ddd4c2544a.

* changed the pytest mark for softmax to be inference_ops

* formatting fix

* cleanup comments

* add missing import

* keep only fp16 matmuls because it's out of this PR's scope
int8-based gemm kernels will be added later

* removed the previous matmul_ext test

* triton quantization kernel removed too

* clean up comments

* added comments for license

* triton matmul always read the autotune table when imported and write the final table when closing

* modfied triton kernels to have a new transposed_model arg

* added license note to files

* set default mlp kernel to be cuda as it's better than triton kernel with bert

* adds changes missed from the prev commit

* added license notes
increased DEEPSPEED_TEST_TIMEOUT from 600 to 900 for triton compilation

* added unit test for triton attention

* moved tests in layer_norm.py to test_layer_norm.py

* removed commented code lines

* removed triton from the main requirement as commented in PR

* follow PascalCase convention in class naming as suggested from pr review

* changes to make deepspeed work without triton
specifically, resolves error with importing any triton ops
added code lines that check the availabilty of triton and skip the tests if it's not

* added a feature to run triton autotune at initialization, i.e., at op-building phase

* fix for the lint/formatting
added " # noqa: F401"

* move triton-bert-benchmark.py to microsoft/DeepSpeedExamples

* modify the code as suggested from PR

* make DEEPSPEED_TEST_TIMEOUT in unit test back to 600s

* made an optioni to skip triton-autotune in config

* lint fix for formatting

* removed repeated has_triton when importing triton
also the change for pr comment

* removed duplicated triton_autotune arg passing

* upgrade to triton 2.0
pydantic.validator for use_triton

* move triton specific op mapping into model_implementation as commented from PR

* removed commented lines

* need to cite where the file came from, as commented from the PR review

* change for the recent merge with the master

* qkv-gemm change to make distilbert work after the merge with the master

* format fix

* fix triton attention for qkv passing for non-pre-norm
requirements all use triton2.0.0

* skip autotune in test_matmul and test_attention with triton

* formatting with pre-commit

* add config for v100 test in matmul_4d kernel (small shared mem requirement)

* inject tritn kernels only in bert and let it inform it through log_dist
set triton to be the latest from requirements

* reduced the config and added mem check for matmul_4d

* added README.md tutorial page for triton-deepspeed

* typi in README

* refine README

* refine readme

* refine readme

* refine readme

* "Fix apex install bugs #3741"

---------

Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants