Heyang Qin

Senior Researcher · Microsoft Azure OpenAI · heyangqin@microsoft.com · qysnn1@gmail.com

I am a Senior Researcher at Microsoft Azure OpenAI, where I focus on building high-performance inference infrastructure for large-scale foundation models. My work involves optimizing inference engines, implementing advanced distributed inference logic, and supporting next-generation flagship models on both NVIDIA and AMD hardware.

Previously, I was a core member of the DeepSpeed team, contributing to industry-leading scalable training and inference systems including DeepSpeed-ZeRO++, ZeRO3, and DeepSpeed-FastGen. I earned my Ph.D. from the University of Nevada, Reno in 2022. My career is dedicated to bridging the gap between massive model scale and hardware efficiency.

Experience

Senior Researcher

Microsoft (Azure OpenAI)

Developing next-generation inference infrastructure for flagship foundation models (including GPT-4 and GPT-5 series).

Architected and implemented high-performance resharding logic for large-scale inference on AMD GPU clusters.
Engineered custom kernels and execution graph optimizations to reduce TTFT and increase serving throughput.
Collaborating on the deployment, scalability, and reliability of flagship models across global-scale data centers.

June 2024 - Present

Researcher / Research Intern

Microsoft (DeepSpeed Team)

DeepSpeed-ZeRO++: Lead author of the ZeRO++ optimization suite, reducing communication volume by up to 4x for LLM training.
DeepSpeed-FastGen: Contributed to the development of FastGen and MII, optimizing system-level inference efficiency and KV-cache management.
System Optimization: Developed communication primitives and apply system optimizations to resolve bottlenecks in distributed training and inference.

October 2020 - June 2024

Research

ZeRO++: Extremely Efficient Large Scale Training Based on ZeRO Optimizer

ZeRO++ is a set of communication optimization strategies built on top of DeepSpeed ZeRO-3. It introduces quantized weights, hierarchical partitioning, and communication-efficient gradients to address communication bottlenecks in large-scale LLM training. This work enables efficient training of trillion-parameter models even on clusters with limited cross-node bandwidth.

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training

Large scale training requires massive parallelism, where large batch training is key but often costs generalization performance. We propose SimiGrad, a fully automated and lightweight adaptive batching methodology. By leveraging a representation of critical gradient noise information, we achieved a record-breaking batch size of 78k in BERT-Large pretraining while maintaining state-of-the-art model performance.

Region Based Reinforcement Learning (RRL) Scheduling for MLaaS

Parallelism settings in Machine Learning as a Service (MLaaS) have a critical impact on performance. We propose a region-based reinforcement learning (RRL) approach that can converge to near-optimal configurations orders of magnitude faster than traditional RL. This was further expanded into RRL Plus, using Bayesian optimization to automatically adjust region sizes for optimal serving efficiency.

Selected Publications

Heyang Qin*, Guanhua Wang*, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He, ZeRO++: Extremely Efficient Collective Communication for Large Model Training, The Twelfth International Conference on Learning Representations, 2023 (ICLR 2023).

Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He, SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement, in Proceedings of the Neural Information Processing Systems 2021 (NeurIPS 2021), Virtual, December, 2021 (Acceptance rate: 2371/9122=26%). [Slides]

Heyang Qin, Syed Zawad, Yanqi Zhou, Sanjay Padhi, Lei Yang, and Feng Yan, Reinforcement Learning Empowered MLaaS Scheduling for Serving Intelligent Internet of Things, IEEE Internet of Things Journal, 2020 (Impact factor: 9.515).

Heyang Qin, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, Feng Yan, Swift Machine Learning Model Serving Scheduling: A Region Based Reinforcement Learning Approach, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2019), Denver, CO, USA, Nov, 2019 (Acceptance rate: 78/344=22%). [Slides]

Heyang Qin

Experience

Senior Researcher

Researcher / Research Intern

Research

Selected Publications

Education

University of Nevada, Reno

University of Electronic Science and Technology of China

Personal Interests