🚀 Feature
PowerSGD can be potentially used for gradient compression: https://arxiv.org/abs/1905.13727. Investigate this algorithm in the context of communication hook.
Motivation
PowerSGD can still keep the associativity/linearity of the gradients after compression, and hence it can still be implemented efficiently by using the native communication library like NCCL. It can compress every tensor of size M*N that represents a variable into 2 smaller tensors M * rank and N * rank for communication. Note that 3D or higher-rank tensors can also be supported, and the compression ratio can be computed by viewing the higher-rank tensor as a 2D tensor.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @xush6528
🚀 Feature
PowerSGD can be potentially used for gradient compression: https://arxiv.org/abs/1905.13727. Investigate this algorithm in the context of communication hook.
Motivation
PowerSGD can still keep the associativity/linearity of the gradients after compression, and hence it can still be implemented efficiently by using the native communication library like NCCL. It can compress every tensor of size M*N that represents a variable into 2 smaller tensors M * rank and N * rank for communication. Note that 3D or higher-rank tensors can also be supported, and the compression ratio can be computed by viewing the higher-rank tensor as a 2D tensor.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @xush6528