Skip to content

Scheduler optimization for AI workloads #1088

@Fizzbb

Description

@Fizzbb

Summary

Optimize Arktos scheduler to leverage unique characteristics of AI workloads. The target is to reduce the AI job queuing time and completion time. The work should be delivered as Arktos scheduler plugins.

Background

AI workloads are moving to the cloud in a fast speed, as the size of model and datasets is ever-increasing. Considering the unique characteristics of AI workloads, e.g. long time and parallel running, heterogeneous resources usages, topology aware, Kubernetes scheduler cannot meet all the requirements. Therefore, there is a need to extend Kubernetes scheduler for AI workloads.

In addition, as cloud operation automation advances, AI algorithms are built in to provider cloud's self-driving capability. Taking advantage of AI algorithms to learn from past scheduling decisions and continuous optimizing resource allocation are necessary for next generation scheduler.

Requirements

  • Investigate existing features of current open source schedulers for AI workloads
  • Design a learning-based scheduler, i.e. the scheduling decision is dynamically improved according to system feedback like resource utilization. Consider particular factors related with AI workload.
  • Design simulation environment, select machine learning algorithms to train the learning-based scheduler
  • Prove average AI workloads competition time reduction, compared to native Kubernetes scheduler

Deliverables

  • Learning-based Kubernetes scheduler extensions
  • Scheduler training environment

Advisor(s): @Fizzbb (Zhaobo Zhang)

Reference

  1. Kubernetes scheduling framework
  2. Volcano, Kubernetes native batch system
  3. OpenAI Gym
  4. Reinforcement learning for job scheduling

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions