Sparse attention + refactor + v0.3.0 #342

jeffra · 2020-09-01T23:48:04Z

Sparse attention
Refactor codebase into ops/runtime/etc.
Tag for v0.3.0
Conditional builds to allow pick/choose which ops to build

* Sparse Transformer: adding codes related to ST * updating dependency version of Triton * applying comments * updating triton dependecny to new version * applied comments * small change

* adding/updating sparsity config patterns * adding random to Variable sparsity * fixing a typo * applying comment adding missing argument docstring

* adding unit test/s for sparse transformer * file-name change update * updated tests based on new list of sparsity configs * Adding/updating sparsity config (#68) * adding/updating sparsity config patterns * adding random to Variable sparsity * fixing a typo * applying comment adding missing argument docstring * adding unit test/s for sparse transformer * file-name change update * updated tests based on new list of sparsity configs * skipping a test if it is run on gpu with compute capability < 7; minimum V100

* updating deepspeed config for Sparse Transformer * Adding/updating sparsity config (#68) * adding/updating sparsity config patterns * adding random to Variable sparsity * fixing a typo * applying comment adding missing argument docstring * updating deepspeed config for Sparse Transformer * updating sparsity config for DeepSpeed parameter list * adding unit test/s for sparse transformer (#60) * adding unit test/s for sparse transformer * file-name change update * updated tests based on new list of sparsity configs * Adding/updating sparsity config (#68) * adding/updating sparsity config patterns * adding random to Variable sparsity * fixing a typo * applying comment adding missing argument docstring * adding unit test/s for sparse transformer * file-name change update * updated tests based on new list of sparsity configs * skipping a test if it is run on gpu with compute capability < 7; minimum V100 * fix a naming issue in utils file: bert_mode -> bert (#69) * updating deepspeed config for Sparse Transformer * updating sparsity config for DeepSpeed parameter list

…ce length per batch (#71) * updating sparsityconfig and layout creation to enable variable sequence length per batch * added utility functions to help with un/padding of input ids/embedding for ST * added utility function to module list and updated unit tests accordingly; add module availability unit tests

* Adding Sparse Transformer Tutorial Documentation

* adding documentation for Sparse Transformer and current result

* major refactor to separate main ds components

…e attention" (#76) * sparse attention name change * updated config, setup, and tests

* update sparse attention post doc * added json config doc for sparse attention and fixed few typos * updated tutorial * updated the post based on the blog post text and image sizings * ran formatter * renamed a figure in the post; sa_backward_pass * updated the triton version with the latest; this version will resolve some synchronization issue was happenening in compile * few figure size and caption updates * fixed a bullet ordering issue * fixed another bullet ordering issue * added warning notes regarding incompatability of Transformer Kernels and SA * adding a note for V100 and Cuda requirement

* add fake pt module to expose old deepspeed_utils and config * switch to sys.modules instead of import to make it more explicit what we're doing

* conditional builds and updated version info * formatting * add mask for conditional builds, address other comments * update to use shaden's updated test env * log install requires list * force local only build * update torch 1.5+cuda10.1 * fix torch version * turn off sparse-attn build by default, must opt-in for now * turn off -I on python, maybe breaking with conda? * turn off basic test in pipeline, just use in install.sh * fail unit tests fast * switch back to torch 1.2 * remove torch instal link * skip sparse attention tests for now

* Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Update to master (#340) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * Versioned torch* optimizations (#341) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * fp16 fused mode * fp16 fused mode (#342) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support serialization versions * Support serialization of different torch versions (#343) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * distributed ckpt draft (#349) * inject parallel write * Support serialization of different torch versions (#343) (#345) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * finish split distributed write * split based-on num_bytes * resolving single node python test * remove irrelavent prints * format Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * torch serialization options * Configurable torch serialization (#350) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Parallel writing through byte boundary slicing (#351) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * fix broken mock_file_writer (#357) * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Various improvements (#376) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * Decoupled checkpointing * New MP slicing algorithm * Format fixes * Decoupled checkpointing support (#384) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Decoupled checkpointing * New MP slicing algorithm * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * add io multiplier for larger scale simulation (#411) * add io multiplier config for simulation * remove prints and test correctness * format * Merge with master * Format fixes * Guanhua/fast io clean v5 (#435) * Add environment variable to make nvcc compilation more verbose (#2759) * Bing/formatting correction (#2764) * modify engine.py for formatting * commit formatting changes on engine.py * Add links to new azureML examples (#2756) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) * Remove hardcoded instances to fp16 in log messages. * Add model_dtype to print the correct format * Respond to PR feedback --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor/Pydantify monitoring config (#2640) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Pin minimum `packaging` requirement (#2771) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix for diffusers v0.12.0 (#2753) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * update copy right in aio * type fix in ds_py_aio_handle * update year in aio/py_test * fix description in util pybind * update and remove prints in fast_file_writer * remove del print * remove dist barrier in engine.py * update year in runtime/model_ckpt * add todo in runtime/model_ckpt/util.py * update year * reverse pip3 * update opbuilder * format * modify print for python * fix print capability * fix print * some fix in flops_profiler (#2068) * bugs in profiler: 1. Tensor.bmm missed in _patch_tensor_methods function 2. missed funtions in _reload_functionals and _reload_tensor_methods functions 3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead. * formatting --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * fix upsample flops compute by skipping unused kargs (#2773) * fix upsample flops compute by skipping unused kargs * fix format * format * Fix broken kernel inject bug (#2776) * format * remove zero change * fix engine issue --------- Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> * Formatting * Formatting * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Fix microbenchmark and unit tests (#450) * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Formatting * Rebase attempts * updates for running with newest dependencies * Pydantic fixes * Rebase fixes * Fix rebase bugs * Add DS utils for tensor casting * Fomat fixes * Fix GDS * Update with io_engine API * Continued rebase * Integrate GDS into writer factory * Add --venv_script option * Formatting fix Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: Ubuntu <jomayeri@microsoft.com>

arashashari and others added 16 commits August 28, 2020 19:29

Sparse Transformer: adding codes related to ST (#58)

0690dc9

* Sparse Transformer: adding codes related to ST * updating dependency version of Triton * applying comments * updating triton dependecny to new version * applied comments * small change

docker update (#66)

0d8bed9

Adding/updating sparsity config (#68)

f1b3472

* adding/updating sparsity config patterns * adding random to Variable sparsity * fixing a typo * applying comment adding missing argument docstring

fix a naming issue in utils file: bert_mode -> bert (#69)

631fba4

Adding Sparse Transformer Tutorial Documentation (#70)

fb5bec2

* Adding Sparse Transformer Tutorial Documentation

adding documentation for Sparse Transformer and current result (#64)

7b06edd

* adding documentation for Sparse Transformer and current result

DeepSpeed ops refactor (#73)

f75791c

* major refactor to separate main ds components

Update ds

57f51aa

Update runner.py

c45d99a

Updated code based on name change from "sparse transformer" to "spars…

e3b01cc

…e attention" (#76) * sparse attention name change * updated config, setup, and tests

Add backwards compatibility for deepspeed.pt (#79)

770345f

* add fake pt module to expose old deepspeed_utils and config * switch to sys.modules instead of import to make it more explicit what we're doing

jeffra requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, samyam and tjruwase as code owners September 1, 2020 23:48

jeffra closed this Sep 1, 2020

jeffra deleted the staging-sparse-attention branch September 1, 2020 23:56

ShadenSmith mentioned this pull request Sep 5, 2020

Improved testing matrix. #340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse attention + refactor + v0.3.0 #342

Sparse attention + refactor + v0.3.0 #342

Uh oh!

jeffra commented Sep 1, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sparse attention + refactor + v0.3.0 #342

Sparse attention + refactor + v0.3.0 #342

Uh oh!

Conversation

jeffra commented Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeffra commented Sep 1, 2020 •

edited

Loading