-
Notifications
You must be signed in to change notification settings - Fork 86
Description
We'd like to try and retrofit the current functionality for pod/container resource assignment in CRI Resource Manager/CRI-RM as an NRI plugin. Our goals are
- primary: implement a reasonable default hardware-topology aware assignment policy, and
- secondary: provide a way for plugging in special application-/vertical-specific policies
The resources of interest are
- the tangible HW resources the kernel let's us arbitrate
- native/compute resources (CPU, memory, huge pages)
- devices
- LLC cache and memory bandwidth
- a few other things the kernel let's us control, for instance
- block I/O throttling
- RT-scheduling/arbitration of time slices set aside for RT processes
To get our full policy scope working we'd need a way to
- track information about running pods and containers
- tap into container creation/deletion/update requests and let the runtime know about resource allocation decisions
- have an assortment of pre- and post-variants/hooks related to container life-cycle events (create/delete/start/stop)
- make changes to existing/running containers, not just the ones being created
These bits of functionality are necessary for the following reasons.
- We need to track what resources are allocated to containers vs. what is free. Likewise, we need to be able to figure out the relationship between pods and containers, so things like intra-pod affinity can be implemented, where containers within a pod are put close to each other in the hardware topology sense.
- Resources are assigned to containers in connection with certain lifecycle events. A resource allocation policy plugin needs to tap into these events, make decisions, and let the rest of the container runtime know about/apply the decisions.
- Enforcing some of the policy decisions require more accurate alignment with the lifecycle of the container than the CRI requests can provide. For instance, assigning a container to an LLC cache clos/class happens by writing container process pids to a special pseudo-filesystem entry. Therefore enforcing a container LLC class is best done in connection with the CRI start request, once a process to run the container command has been forked but before it has actually exec'd the eventual container command. Tapping into the realted basic CRI requests do not provide the necessary resolution for achieving this.
- Sometimes even simple resource policy decisions related to processing a CRI request has further resource-related consequences on containers other than the one directly involved in the CRI request. For instance, when a container running on a set of exclusively allocated CPUs is deleted, all containers without exlcusive CPU allocations and running in the same 'HW topology CPU pool' should be updated to allow them to run on the newly freed cpuset.
In its current incarnation, CRI-RM sits as a CRI proxy between clients (the kubelet only, really) and the runtime. It is non-transparent in nature as it might modify, according to policies, key CRI requests related to container lifecycle (creation, update) before forwarding them and it also might generate unsolicited requests to update otherwise unrelated containers that policy decisions had an (resource related) effect on. After an initial pod and container discovery, CRI-RM keeps its internal cache up to date according to the intercepted/modified CRI requests and responses, so it's 'NRI-like plugins' (the active policy running inside CRI-RM) has access to nearly all information about pods and containers.
Although due to the current proxy-based setup the current implementation can virtually modify any aspect of a container, the things we do and therefore currently think should be possible for NRI plugins (for our purposes) are the following:
- alter container resources (cpuset.{cpus,mems}, CFS {shares,quota,period}, memory limit, hugepage limits)
- alter devices
- alter environment variables
- add extra mounts (used for exposing/updating extra, resource-related information to containers)
- probably/maybe alter annotations and labels
All except modifying container resources is limited/possible to alter only during container creation.
Currently it is unclear to me how/if the following things could be achieved with the current NRI architecture/infra:
- tracking/querying all pods/sandboxes and containers (other than just the one the current request is directly operating on)
- some of the alterations mentioned above (environment variables, mounts)
- altering resources of containers other than the one being directly operated on by the current request