Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Sep 1, 2020

  1. Sparse attention
  2. Refactor codebase into ops/runtime/etc.
  3. Tag for v0.3.0
  4. Conditional builds to allow pick/choose which ops to build

arashashari and others added 16 commits August 28, 2020 19:29
* Sparse Transformer: adding codes related to ST

* updating dependency version of Triton

* applying comments

* updating triton dependecny to new version

* applied comments

* small change
* adding/updating sparsity config patterns

* adding random to Variable sparsity

* fixing a typo

* applying comment adding missing argument docstring
* adding unit test/s for sparse transformer

* file-name change update

* updated tests based on new list of sparsity configs

* Adding/updating sparsity config (#68)

* adding/updating sparsity config patterns

* adding random to Variable sparsity

* fixing a typo

* applying comment adding missing argument docstring

* adding unit test/s for sparse transformer

* file-name change update

* updated tests based on new list of sparsity configs

* skipping a test if it is run on gpu with compute capability < 7; minimum V100
* updating deepspeed config for Sparse Transformer

* Adding/updating sparsity config (#68)

* adding/updating sparsity config patterns

* adding random to Variable sparsity

* fixing a typo

* applying comment adding missing argument docstring

* updating deepspeed config for Sparse Transformer

* updating sparsity config for DeepSpeed parameter list

* adding unit test/s for sparse transformer (#60)

* adding unit test/s for sparse transformer

* file-name change update

* updated tests based on new list of sparsity configs

* Adding/updating sparsity config (#68)

* adding/updating sparsity config patterns

* adding random to Variable sparsity

* fixing a typo

* applying comment adding missing argument docstring

* adding unit test/s for sparse transformer

* file-name change update

* updated tests based on new list of sparsity configs

* skipping a test if it is run on gpu with compute capability < 7; minimum V100

* fix a naming issue in utils file: bert_mode -> bert (#69)

* updating deepspeed config for Sparse Transformer

* updating sparsity config for DeepSpeed parameter list
…ce length per batch (#71)

* updating sparsityconfig and layout creation to enable variable sequence length per batch

* added utility functions to help with un/padding of input ids/embedding for ST

* added utility function to module list and updated unit tests accordingly; add module availability unit tests
* Adding Sparse Transformer Tutorial Documentation
* adding documentation for Sparse Transformer and current result
* major refactor to separate main ds components
…e attention" (#76)

* sparse attention name change

* updated config, setup, and tests
* update sparse attention post doc

* added json config doc for sparse attention and fixed few typos

* updated tutorial

* updated the post based on the blog post text and image sizings

* ran formatter

* renamed a figure in the post; sa_backward_pass

* updated the triton version with the latest; this version will resolve some synchronization issue was happenening in compile

* few figure size and caption updates

* fixed a bullet ordering issue

* fixed another bullet ordering issue

* added warning notes regarding incompatability of Transformer Kernels and SA

* adding a note for V100 and Cuda requirement
* add fake pt module to expose old deepspeed_utils and config

* switch to sys.modules instead of import to make it more explicit what we're doing
* conditional builds and updated version info

* formatting

* add mask for conditional builds, address other comments

* update to use shaden's updated test env

* log install requires list

* force local only build

* update torch 1.5+cuda10.1

* fix torch version

* turn off sparse-attn build by default, must opt-in for now

* turn off -I on python, maybe breaking with conda?

* turn off basic test in pipeline, just use in install.sh

* fail unit tests fast

* switch back to torch 1.2

* remove torch instal link

* skip sparse attention tests for now
tjruwase added a commit that referenced this pull request Apr 12, 2025
* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Update to master (#340)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* Versioned torch* optimizations (#341)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* fp16 fused mode

* fp16 fused mode  (#342)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* Support serialization versions

* Support serialization of different torch versions (#343)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* distributed ckpt draft (#349)

* inject parallel write

* Support serialization of different torch versions (#343) (#345)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* finish split distributed write

* split based-on num_bytes

* resolving single node python test

* remove irrelavent prints

* format

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* torch serialization options

* Configurable torch serialization (#350)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

* torch serialization options

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* Distributed writer slicing on byte boundary

* Fix typo

* FastFileWriter Config; Parallel writer nodes

* Minor fix

* remove warning from fast-io-ckpt (#354)

* Relocate debug print

* Parallel writing through byte boundary slicing (#351)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

* torch serialization options

* Distributed writer slicing on byte boundary

* Fix typo

* FastFileWriter Config; Parallel writer nodes

* Minor fix

* remove warning from fast-io-ckpt (#354)

* Relocate debug print

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>

* fix broken mock_file_writer (#357)

* Report write speed

* DP writing

* DP MoE checkpoints
Generalize DP dense checkpoints for socket/machine options

* Various improvements (#376)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

* torch serialization options

* Distributed writer slicing on byte boundary

* Fix typo

* FastFileWriter Config; Parallel writer nodes

* Minor fix

* remove warning from fast-io-ckpt (#354)

* Relocate debug print

* Report write speed

* DP writing

* DP MoE checkpoints
Generalize DP dense checkpoints for socket/machine options

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>

* Decoupled checkpointing

* New MP slicing algorithm

* Format fixes

* Decoupled checkpointing support (#384)

* Integrate NVIDIA GPUDirect Storage into nvme library

* 1) Remove debug prints
2) Create write file with random data
3) Delete target file before new writes

* Workaround gds perf issue by leaking buffers

* DGX2 mount/unmount utililties

* Formatting

* Add torch save/load

* Add torch save/load

* Remove gds

* Add torch legacy save

* Update to new cli

* Add function signatures
Add file_offset arg to read/write apis

* Remove redundant asserts

* Add DeepSpeedFileWriter

* Add mock and python file writers

* Format fixes

* More perf counters

* Fix pinned_offset bug; Show as not real python file object

* Buffer copy speed

* Add torch_fastio option

* Format fixes

* Measure torch_fastio perf

* Force flush

* Formatting

* Renamings

* Fix device bug

* Disable torch.distributed requirement

* Renaming

* Integrate fast model checkpointing

* Double I/O buffer optimization

* Support larger sizes

* Refactoring; save_storage api

* Cast to byte tensor

* Handle storage object saves

* Remove mysterious import

* Api to save storage object list; refactor stats

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

* Yangli2/fastio double buffer pytorch optimized (#291)

* Double I/O buffer optimization

* add pytorch optimization

* fixed some syntax errors

* comment out save_storage for mock

* uncomment save storage for mock

* fixed indentation

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>

* making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable

* Dump fast_writer stats only on rank 0

* Configuration option for fused fp16 optimizer

* Update to new API

* Format fixes

* Support torch* optimization for version 1.12

* Formatting

* Versioned torch* optimization

* fp16 fused mode

* Support serialization versions

* torch serialization options

* Distributed writer slicing on byte boundary

* Fix typo

* FastFileWriter Config; Parallel writer nodes

* Minor fix

* remove warning from fast-io-ckpt (#354)

* Relocate debug print

* Report write speed

* DP writing

* DP MoE checkpoints
Generalize DP dense checkpoints for socket/machine options

* Decoupled checkpointing

* New MP slicing algorithm

* Format fixes

Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>

* add io multiplier for larger scale simulation (#411)

* add io multiplier config for simulation

* remove prints and test correctness

* format

* Merge with master

* Format fixes

* Guanhua/fast io clean v5 (#435)

* Add environment variable to make nvcc compilation more verbose (#2759)

* Bing/formatting correction (#2764)

* modify engine.py for formatting

* commit formatting changes on engine.py

* Add links to new azureML examples (#2756)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743)

* Remove hardcoded instances to fp16 in log messages.

* Add model_dtype to print the correct format

* Respond to PR feedback

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Refactor/Pydantify monitoring config (#2640)

* pydantify monitoring configs

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Pin minimum `packaging` requirement (#2771)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix for diffusers v0.12.0 (#2753)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* update copy right in aio

* type fix in ds_py_aio_handle

* update year in aio/py_test

* fix description in util pybind

* update and remove prints in fast_file_writer

* remove del print

* remove dist barrier in engine.py

* update year in runtime/model_ckpt

* add todo in runtime/model_ckpt/util.py

* update year

* reverse pip3

* update opbuilder

* format

* modify print for python

* fix print capability

* fix print

* some fix in flops_profiler (#2068)

* bugs in profiler:
1. Tensor.bmm missed in _patch_tensor_methods function
2. missed funtions in _reload_functionals and _reload_tensor_methods functions
3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead.

* formatting

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* fix upsample flops compute by skipping unused kargs (#2773)

* fix upsample flops compute by skipping unused kargs

* fix format

* format

* Fix broken kernel inject bug (#2776)

* format

* remove zero change

* fix engine issue

---------

Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com>
Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>

* Formatting

* Formatting

* Debug file delete slowdown

* Investigate write perf

* Investigate write perf

* Fix mising args

* Fix microbenchmark and unit tests (#450)

* Debug file delete slowdown

* Investigate write perf

* Investigate write perf

* Fix mising args

* Formatting

* Rebase attempts

* updates for running with newest dependencies

* Pydantic fixes

* Rebase fixes

* Fix rebase bugs

* Add DS utils for tensor casting

* Fomat fixes

* Fix GDS

* Update with io_engine API

* Continued rebase

* Integrate GDS into writer factory

* Add --venv_script option

* Formatting fix

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com>
Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants