debuging for zero-dual #350

jren73 · 2020-09-03T18:05:47Z

No description provided.

modified: deepspeed/pt/fp16_unfused_optimizer.py new file: install_output.txt modified: tests/unit/test_dynamic_loss_scale.py

modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py

modified: tests/unit/test_dynamic_loss_scale.py

ZeR0-2 CPU offload updadte

modified: deepspeed/pt/deepspeed_cpu_adam.py

ZeRO-2 CPU_offload

modified: deepspeed/pt/deepspeed_zero_optimizer.py

CPU offload

modified: deepspeed_light.py modified: deepspeed_zero_optimizer.py ../../deepspeed_zero_optimizer_cpu_offload.py

deleted: deepspeed_cpu_adam.py

modified: deepspeed/pt/deepspeed_light.py

modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: deepspeed/pt/deepspeed_zero_utils.py modified: tests/unit/test_fp16.py

modified: deepspeed/pt/deepspeed_light.py

modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py

modified: deepspeed/pt/deepspeed_config.py

modified: deepspeed/pt/deepspeed_checkpointing.py

modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_lr_schedules.py modified: deepspeed/pt/deepspeed_run.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: deepspeed/pt/deepspeed_config.py modified: deepspeed/pt/deepspeed_zero_optimizer.py

…el running reasonably well. 47 TFlops without optimizer step, 26 TFlops with optimizer step. Grad acc of 2 on a single GPU with micro batch size of 24, hidden dim 4096, and 6 layers. Not a practical model, but just testing offload overheads. Ran with overlap true and reduce_bucket_size of 50M

tjruwase · 2020-09-03T22:12:17Z

Replaced by #354

* Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Update to master (#340) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * Versioned torch* optimizations (#341) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * fp16 fused mode * fp16 fused mode (#342) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support serialization versions * Support serialization of different torch versions (#343) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * distributed ckpt draft (#349) * inject parallel write * Support serialization of different torch versions (#343) (#345) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * finish split distributed write * split based-on num_bytes * resolving single node python test * remove irrelavent prints * format Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * torch serialization options * Configurable torch serialization (#350) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Parallel writing through byte boundary slicing (#351) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * fix broken mock_file_writer (#357) * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Various improvements (#376) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * Decoupled checkpointing * New MP slicing algorithm * Format fixes * Decoupled checkpointing support (#384) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Decoupled checkpointing * New MP slicing algorithm * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * add io multiplier for larger scale simulation (#411) * add io multiplier config for simulation * remove prints and test correctness * format * Merge with master * Format fixes * Guanhua/fast io clean v5 (#435) * Add environment variable to make nvcc compilation more verbose (#2759) * Bing/formatting correction (#2764) * modify engine.py for formatting * commit formatting changes on engine.py * Add links to new azureML examples (#2756) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) * Remove hardcoded instances to fp16 in log messages. * Add model_dtype to print the correct format * Respond to PR feedback --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor/Pydantify monitoring config (#2640) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Pin minimum `packaging` requirement (#2771) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix for diffusers v0.12.0 (#2753) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * update copy right in aio * type fix in ds_py_aio_handle * update year in aio/py_test * fix description in util pybind * update and remove prints in fast_file_writer * remove del print * remove dist barrier in engine.py * update year in runtime/model_ckpt * add todo in runtime/model_ckpt/util.py * update year * reverse pip3 * update opbuilder * format * modify print for python * fix print capability * fix print * some fix in flops_profiler (#2068) * bugs in profiler: 1. Tensor.bmm missed in _patch_tensor_methods function 2. missed funtions in _reload_functionals and _reload_tensor_methods functions 3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead. * formatting --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * fix upsample flops compute by skipping unused kargs (#2773) * fix upsample flops compute by skipping unused kargs * fix format * format * Fix broken kernel inject bug (#2776) * format * remove zero change * fix engine issue --------- Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> * Formatting * Formatting * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Fix microbenchmark and unit tests (#450) * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Formatting * Rebase attempts * updates for running with newest dependencies * Pydantic fixes * Rebase fixes * Fix rebase bugs * Add DS utils for tensor casting * Fomat fixes * Fix GDS * Update with io_engine API * Continued rebase * Integrate GDS into writer factory * Add --venv_script option * Formatting fix Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: Ubuntu <jomayeri@microsoft.com>

jren73 and others added 30 commits August 4, 2020 01:00

cpu-offload

5f63fc8

update

e01238b

updte

73b956b

deleted: deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py

98deb70

modified: deepspeed/pt/fp16_unfused_optimizer.py new file: install_output.txt modified: tests/unit/test_dynamic_loss_scale.py

modified: deepspeed/pt/deepspeed_zero_optimizer.py

e3b2a42

update

f832a2e

modified: deepspeed/pt/deepspeed_cpu_adam.py

004884b

modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py

deleted: install_output.txt

0effd77

modified: deepspeed/pt/fp16_unfused_optimizer.py

af3b834

modified: tests/unit/test_dynamic_loss_scale.py

Merge pull request #2 from jren73/ZeRO-2-cpu_offload

e2d936d

ZeR0-2 CPU offload updadte

modified: deepspeed/pt/deepspeed_cpu_adam.py

ef5c785

Merge pull request #3 from jren73/ZeRO-2-cpu_offload

7f0a856

modified: deepspeed/pt/deepspeed_cpu_adam.py

modified: deepspeed/pt/deepspeed_zero_optimizer.py

6e45e8b

Merge pull request #4 from jren73/ZeRO-2-cpu_offload

e930604

ZeRO-2 CPU_offload

Merge branch 'master' into master

d2cc800

modified: deepspeed/pt/deepspeed_cpu_adam.py

f8812b9

modified: deepspeed/pt/deepspeed_zero_optimizer.py

Merge pull request #5 from jren73/ZeRO-2-cpu_offload

d1a435c

CPU offload

deleted: deepspeed_cpu_adam.py

6415738

modified: deepspeed_light.py modified: deepspeed_zero_optimizer.py ../../deepspeed_zero_optimizer_cpu_offload.py

Merge pull request #6 from jren73/ZeRO-2-cpu_offload

7eb6041

deleted: deepspeed_cpu_adam.py

modified: deepspeed/pt/deepspeed_light.py

fbd79c6

Merge pull request #7 from jren73/ZeRO-2-cpu_offload

f1a180f

modified: deepspeed/pt/deepspeed_light.py

modified: deepspeed/pt/deepspeed_light.py

5181c60

modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: deepspeed/pt/deepspeed_zero_utils.py modified: tests/unit/test_fp16.py

Merge pull request #8 from jren73/ZeRO-2-cpu_offload

a2b7433

modified: deepspeed/pt/deepspeed_light.py

modified: deepspeed/pt/deepspeed_config.py

41f18d1

modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py

Merge pull request #9 from jren73/ZeRO-2-cpu_offload

3835b22

modified: deepspeed/pt/deepspeed_config.py

modified: deepspeed/pt/deepspeed_checkpointing.py

ffad985

Merge pull request #12 from jren73/ZeRO-2-cpu_offload

1e46c91

modified: deepspeed/pt/deepspeed_checkpointing.py

update DSE to ZeRO-Offload commit

646f709

offload bug fix but slow optimizer step

fcafe59

samyam and others added 4 commits September 2, 2020 23:51

Seems to be working and performant for grad_acc=1

b743bc1

modified: deepspeed/pt/deepspeed_zero_optimizer.py

7a6ddd9

reset cpu buffers only when cpu_offload lag is turned on in ZeRO-Stage2

93a331a

jren73 requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners September 3, 2020 18:05

tjruwase closed this Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

debuging for zero-dual #350

debuging for zero-dual #350

Uh oh!

jren73 commented Sep 3, 2020

Uh oh!

tjruwase commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

debuging for zero-dual #350

debuging for zero-dual #350

Uh oh!

Conversation

jren73 commented Sep 3, 2020

Uh oh!

tjruwase commented Sep 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants