Skip to content

Conversation

@arashashari
Copy link
Owner

No description provided.

arashashari and others added 30 commits May 18, 2020 09:33
* adding BingSqaud e2e test

* updating the draft test; bring final step under try section

* finalizinf test for base deepspeed and deepspeed with ZeRO

* applying the comment (thanks Jeff); fixed formatting
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS

Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: yuxionghe <yuxhe@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
* BERT title
* updates to support fp32 grad clipping and disable max_grad_norm
* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather

* Fix for CPU memory Bloating Issue caused by pyorch backward graph creation in allgather. Fixed by calling detach on tensors before calling all_gather
Contiguous Gradients should be set to false by default. Its not useful unless the model is very large
* add support for predivide as a flag
* add predivide json config, remove allgather_disable (as it's not currently used anymore)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* fix: typo in code docs

* more pythonic code
* Transformer kernels release

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
jeffra and others added 29 commits July 24, 2020 10:21
* fix nv_peer_mem version in dockerfile

* fix security issue, remove pillow dependency (this is only needed for cifar example which has its own requirements.txt)
mpu object is bound to the class instance.. 

the if statement uses  `self.mpu'  but just `mpu` is called in the following lines.. 

This raises a NameError
The parenthesis alter the evaluation of the assert() and it will always evaluate to True.
Add webinar on-demand links and update readme
* add fix and tests for get_lr from lr_scheduler before training starts
* update fan out flag for pdsh
* turn off multi-node launch if only 1 node
* Update deepspeed_checkpointing.py

* formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation

* Gradient Accumulation support for Stage 2. Model tests added to test the feature

* formatting

* Update deepspeed_light.py

removing comment

* Update ds_config_func_bs8_zero1.json

reverting this file back. Its not needed for this PR

* defining baseline prefix

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Renaming config files to gas3
* Sparse attn + ops/runtime refactor + v0.3.0

Co-authored-by: Arash Ashari <arashari@microsoft.com>

Co-authored-by: Arash Ashari <arashari@microsoft.com>
Remove llvm/cmake install for now, causing pyyaml issues
@arashashari arashashari merged commit a2984d0 into arashashari:master Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.