-
Notifications
You must be signed in to change notification settings - Fork 67
Description
Summary
Optimize Arktos scheduler to leverage unique characteristics of AI workloads. The target is to reduce the AI job queuing time and completion time. The work should be delivered as Arktos scheduler plugins.
Background
AI workloads are moving to the cloud in a fast speed, as the size of model and datasets is ever-increasing. Considering the unique characteristics of AI workloads, e.g. long time and parallel running, heterogeneous resources usages, topology aware, Kubernetes scheduler cannot meet all the requirements. Therefore, there is a need to extend Kubernetes scheduler for AI workloads.
In addition, as cloud operation automation advances, AI algorithms are built in to provider cloud's self-driving capability. Taking advantage of AI algorithms to learn from past scheduling decisions and continuous optimizing resource allocation are necessary for next generation scheduler.
Requirements
- Investigate existing features of current open source schedulers for AI workloads
- Design a learning-based scheduler, i.e. the scheduling decision is dynamically improved according to system feedback like resource utilization. Consider particular factors related with AI workload.
- Design simulation environment, select machine learning algorithms to train the learning-based scheduler
- Prove average AI workloads competition time reduction, compared to native Kubernetes scheduler
Deliverables
- Learning-based Kubernetes scheduler extensions
- Scheduler training environment
Advisor(s): @Fizzbb (Zhaobo Zhang)
Reference
- Kubernetes scheduling framework
- Volcano, Kubernetes native batch system
- OpenAI Gym
- Reinforcement learning for job scheduling