Scheduler optimization for AI workloads

**Summary**

Optimize Arktos scheduler to leverage unique characteristics of AI workloads. The target is to reduce the AI job queuing time and completion time. The work should be delivered as Arktos scheduler plugins.


**Background**

AI workloads are moving to the cloud in a fast speed, as the size of model and datasets is ever-increasing. Considering the unique characteristics of AI workloads, e.g. long time and parallel running, heterogeneous resources usages, topology aware, Kubernetes scheduler cannot meet all the requirements. Therefore, there is a need to extend Kubernetes scheduler for AI workloads. 

In addition, as cloud operation automation advances, AI algorithms are built in to provider cloud's self-driving capability. Taking advantage of AI algorithms to learn from past scheduling decisions and continuous optimizing resource allocation are necessary for next generation scheduler.

**Requirements**
- Investigate existing features of current open source schedulers for AI workloads 
- Design a learning-based scheduler, i.e. the scheduling decision is dynamically improved according to system feedback like resource utilization. Consider particular factors related with AI workload.
- Design simulation environment, select machine learning algorithms to train the learning-based scheduler
- Prove average AI workloads competition time reduction, compared to native Kubernetes scheduler


**Deliverables**
- Learning-based Kubernetes scheduler extensions
- Scheduler training environment 

**Advisor(s)**: @Fizzbb (Zhaobo Zhang)

**Reference**

1. [Kubernetes scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/)
2. [Volcano, Kubernetes native batch system](https://github.com/volcano-sh/volcano)
3. [OpenAI Gym](https://gym.openai.com/)
4. Reinforcement learning for job scheduling 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler optimization for AI workloads #1088

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scheduler optimization for AI workloads #1088

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions