-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Hi, all
#23917 introduces Nvidia-docker, it has provided a convenient way to use Nvidia GPU accelerator in container.
To use more hardware accelerators(e.g. AMD GPU, FPGA, QAT) in containers,
we propose a common way to provide this feature with docker plugins.
Abstract
Accelerators can be considered as a special kind of resource like cpu and memory,
though it may contain some other basic resources (e.g.,
volume and devices in nvidia-docker).
The implementation of accelerator should be transparent to containers,
which means container should not need to know which devices or runtime libs are required,
or whether they are matched.
It is accelerator driver's responsibility for handling all these things.
Accelerator request is made up of accelerator runtime descriptions, and descriptions only.
Docker image do not need to hold accelerator runtime libraries in it,
so it would not bind to any specific device.
Accelerator requests can be image labels or docker run arguments.
Accelerator supports should be able to extended by docker plugin mechanism.
Vendors can build their own accelerator plugin to express:
- which accelerator runtimes (e.g. cuda, opencl, rsa) it support
- how to allocate/release accelerator resource
- how to prepare required environment, e.g. devices, libs, etc.
- how to reset/reuse accelerator resource
- how to collecting or preprocessing status data
- ...
When docker engine receives the request to run a container with accelerator , it will:
- selecting the right accelerator plugin
- allocating accelerator resource from the plugin
- updating container configuration according to information provided by the plugin.
The lifecycle of accelerator is really hard to determine,
some may die with the container while others may survive(used by other container),
and accelerator sharing may be needed,
so a docker subcommand may be necessary to maintain accelerator status.
How to request accelerators in a container
Apps may request accelerators through runtime description.
Apps are implemented with specified runtimes like 'cuda:8.0' or 'opencl',
thus this is the accelerator request.
It can be used through one of the following ways:
-
docker image label
runtime request are determined when docker image is built,
so users can store this information in images through label:LABEL runtime="gpu0=cuda:8.0;gpu1=opencl:2.2"// means nothing ,just for exampleLABEL runtime="fpga0=com.company/fpga/sha256:1.0"In the first example, the runtime label means request two accelerators:
one with cuda:8.0 and the other opencl:2.2.The second example will allocate one fpga accelerator that support sha256:v1.0
from company.com. -
docker run arguments
users can also specify accelerators when creating/running a container, just like this:
# docker create --accel cuda:8.0 nvidia/digits:4.0# docker create --accel fpga0=com.company/fpga/compress:1.0 --accel fpga1=com.company/fpga/decomp:1.0 XXXXor overwrite an accelerator runtime LABEL defined in image
# docker create --accel gpu0=cuda:8.0 nvidia/digits:4.0Accelerators
explicitallocated by --accel options arepersistent,
which means they are reserved to container until the container is removed.
Theseimplicitaccelerators defined by image LABEL arenon-persistent,
which only reserved to container when it is running.
This distinction is made to prevent waste of resources for unused&stopped container.
If accelerators should be reserved even when container is stopped,
they should be declared aspersistentusing --accel option.
How to parse requests into accelerators
Docker engine can query all the accelerator driver plugins with the requested runtime description.
Drivers scan its capacity(all runtimes it supports), and check whether it can provide specific runtimes or not.
Then engine will request driver to allocate accelerators.
Accelerator abstraction in docker
The accelerators in container contain three attributes:
volume(to provide runtime libraries),device,env(maybe not necessary, but it will provide flexibility to accelerators)
Drivers will provide these informations about a accelerator to docker engine,
and container configurations about volume ,device and env will be updated.
Interfaces provided by accelerator (driver) plugin
The accelerator drivers need to support basic accelerator operations, include:
Query// Is runtime supported in this driverAllocateAccel// To create a new accelerator, correspond to docker create.PrepareAccel// Get accelerator ready(e.g. program bitstream into fpga),
return volume, device and env, correspond to docker start.ResetAccel// Reset accelerator, correspond to docker stop.ReleaseAccel// To remove accelerator, correspond to docker rm
Accelerator drivers should implement all these functions to fulfil accelerator
management in container lifecycle.
Docker engine will call these functions in the right place to use accelerators
in container.
These functions are easy to expand with docker plugins.
Accelerator vendor can provide their device support by implementing
all the accelerator plugin interfaces.
Accelerator management in docker
As metioned above, we may need accelerator subcommand to maintain
all accelerator status and to create/list/remove accelerators,
just like network and volume(not sure if this is realy necessary).
ping @justincormack @flx42 @3XX0 , please take a look, thks 😃
cc @forever043