I am a Senior Researcher at Microsoft Azure OpenAI, where I focus on building high-performance inference infrastructure for large-scale foundation models. My work involves optimizing inference engines, implementing advanced distributed inference logic, and supporting next-generation flagship models on both NVIDIA and AMD hardware.
Previously, I was a core member of the DeepSpeed team, contributing to industry-leading scalable training and inference systems including DeepSpeed-ZeRO++, ZeRO3, and DeepSpeed-FastGen. I earned my Ph.D. from the University of Nevada, Reno in 2022. My career is dedicated to bridging the gap between massive model scale and hardware efficiency.
Developing next-generation inference infrastructure for flagship foundation models (including GPT-4 and GPT-5 series).
ZeRO++: Extremely Efficient Large Scale Training Based on ZeRO Optimizer
ZeRO++ is a set of communication optimization strategies built on top of DeepSpeed ZeRO-3. It introduces quantized weights, hierarchical partitioning, and communication-efficient gradients to address communication bottlenecks in large-scale LLM training. This work enables efficient training of trillion-parameter models even on clusters with limited cross-node bandwidth.
SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training
Large scale training requires massive parallelism, where large batch training is key but often costs generalization performance. We propose SimiGrad, a fully automated and lightweight adaptive batching methodology. By leveraging a representation of critical gradient noise information, we achieved a record-breaking batch size of 78k in BERT-Large pretraining while maintaining state-of-the-art model performance.
Region Based Reinforcement Learning (RRL) Scheduling for MLaaS
Parallelism settings in Machine Learning as a Service (MLaaS) have a critical impact on performance. We propose a region-based reinforcement learning (RRL) approach that can converge to near-optimal configurations orders of magnitude faster than traditional RL. This was further expanded into RRL Plus, using Bayesian optimization to automatically adjust region sizes for optimal serving efficiency.
Heyang Qin*, Guanhua Wang*, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He, ZeRO++: Extremely Efficient Collective Communication for Large Model Training, The Twelfth International Conference on Learning Representations, 2023 (ICLR 2023).
Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He, SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement, in Proceedings of the Neural Information Processing Systems 2021 (NeurIPS 2021), Virtual, December, 2021 (Acceptance rate: 2371/9122=26%). [Slides]
Heyang Qin, Syed Zawad, Yanqi Zhou, Sanjay Padhi, Lei Yang, and Feng Yan, Reinforcement Learning Empowered MLaaS Scheduling for Serving Intelligent Internet of Things, IEEE Internet of Things Journal, 2020 (Impact factor: 9.515).
Heyang Qin, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, Feng Yan, Swift Machine Learning Model Serving Scheduling: A Region Based Reinforcement Learning Approach, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), Denver, CO, USA, Nov, 2019 (Acceptance rate: 78/344=22%). [Slides]
GPA: 4.00 | Advisor: Dr. Feng Yan & Dr. Lei Yang
I am an active contributor to the open-source AI community. Outside of research, I enjoy table tennis and volleyball. I also have a strong passion for archaic Chinese literature and detective fiction.