Skip to content

AWS ENI support#8347

Merged
tgraf merged 28 commits intomasterfrom
pr/tgraf/aws-eni-ipam
Jun 26, 2019
Merged

AWS ENI support#8347
tgraf merged 28 commits intomasterfrom
pr/tgraf/aws-eni-ipam

Conversation

@tgraf
Copy link
Copy Markdown
Contributor

@tgraf tgraf commented Jun 19, 2019

This PR adds AWS ENI IP allocation and datapath support. The details can be found in the individual commit messages and in the documentation commit.

Fixes: #6430


This change is Reviewable

@tgraf tgraf added the wip label Jun 19, 2019
@tgraf tgraf requested a review from a team June 19, 2019 08:34
@tgraf tgraf requested a review from a team as a code owner June 19, 2019 08:34
@tgraf tgraf requested a review from a team June 19, 2019 08:34
@tgraf tgraf requested review from a team as code owners June 19, 2019 08:34
@tgraf tgraf requested a review from a team June 19, 2019 08:34
@tgraf tgraf force-pushed the pr/tgraf/aws-eni-ipam branch 3 times, most recently from f18d66e to d8b78f1 Compare June 19, 2019 11:34
@coveralls
Copy link
Copy Markdown

coveralls commented Jun 19, 2019

Coverage Status

Coverage increased (+0.2%) to 44.655% when pulling 5a3a9f91daa78cf2bfea3b29f129f8cf40ca8c55 on pr/tgraf/aws-eni-ipam into f4955ec on master.

@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 19, 2019

test-me-please

00:02:11.510      runtime: go build -ldflags '-X "github.com/cilium/cilium/pkg/version.Version=1.5.90 80ac62534 2019-06-19T11:34:05+00:00 go version go1.12.5 linux/amd64" -X "github.com/cilium/cilium/pkg/envoy.RequiredEnvoyVersionSHA=18ed0eab0eb161b21e25c50b8d360ba6507b9a4b" -X "github.com/cilium/cilium/pkg/datapath/loader.DatapathSHA=" -extldflags -Wl,-soname,libcilium.so.1' -o libcilium.so.1 -buildmode=c-shared
00:32:10.994  Sending interrupt signal to process

@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 19, 2019

test-me-please

01:12:01.770      k8s2-1.14: Hit:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
01:12:01.770      k8s2-1.14: Hit:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
01:12:01.770      k8s2-1.14: Hit:7 https://download.docker.com/linux/ubuntu bionic InRelease
01:12:01.770      k8s2-1.14: Hit:1 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
01:12:04.838      k8s2-1.14: Reading package lists...
01:12:05.866      k8s2-1.14: E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
01:12:05.866      k8s2-1.14: E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
01:12:05.866  The SSH command responded with a non-zero exit status. Vagrant
01:12:05.866  assumes that this means the command failed. The output for this command
01:12:05.866  should be in the log above. Please read the output to determine what
01:12:05.866  went wrong.

@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 19, 2019

test-me-please

⚠️  Found a "POTENTIAL DEADLOCK:" in logs

Tracked down to option.Config.ConfigPatchMutex being held while compileBase() is being called.

@tgraf tgraf added pending-review release-note/major This PR introduces major new functionality to Cilium. and removed wip labels Jun 19, 2019
@tgraf tgraf changed the title [WIP] AWS ENI support AWS ENI support Jun 19, 2019
@tgraf tgraf force-pushed the pr/tgraf/aws-eni-ipam branch from d8b78f1 to 8f98f47 Compare June 19, 2019 21:15
@tgraf tgraf requested a review from a team as a code owner June 19, 2019 21:15
@tgraf tgraf force-pushed the pr/tgraf/aws-eni-ipam branch from 8f98f47 to 6f1108c Compare June 19, 2019 23:34
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 19, 2019

test-me-please

The following tests failed:

Suite-k8s-1.10.K8sDatapathConfig Encapsulation Check connectivity with transparent encryption and VXLAN encapsulation
Suite-k8s-1.10.K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with automatic direct nodes routes
Suite-k8s-1.14.K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with automatic direct nodes routes

Failing because #8322 is not merged yet

Copy link
Copy Markdown
Member

@ianvernon ianvernon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall, the architecture of this is solid and well-documented. I'd probably have to immerse myself in the code by running it in a cluster and playing around to understand the nitty-gritty details given the size of the PR.

Most of my comments are documentation / clarification about certain behaviors & operations.

Are we marking this as GA for v1.6? Or beta?

@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 22, 2019

Are we marking this as GA for v1.6? Or beta?

That question is still open. There is a large demand for this feature and several users are already testing this and planning to use it in critical environments as soon as possible. I think we should make a call on how to label it before we release based on the experiences gained until then.

@tgraf tgraf force-pushed the pr/tgraf/aws-eni-ipam branch 2 times, most recently from 3928c1f to 65a2477 Compare June 24, 2019 14:31
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 24, 2019

test-me-please

tgraf added 22 commits June 25, 2019 16:13
Signed-off-by: Thomas Graf <thomas@cilium.io>
We will have code during Daemon initialization rely on being able to access
custom resources.

Signed-off-by: Thomas Graf <thomas@cilium.io>
The following additional information is exposed via the API:
 * IPAM mode
 * Gatway IP (optional)
 * Additional CIDRs to which the IP has access to (optional)
 * MAC of master interface (optional)

This is in preperation of upcoming ENI mode

Signed-off-by: Thomas Graf <thomas@cilium.io>
Building on the previous commit, this allows an IPAM plugin to provide the
additional allocation information.

Signed-off-by: Thomas Graf <thomas@cilium.io>
This adds an initial, minimal version of a CiliumNode CRD-backed IP allocator.
It is enabled by setting --ipam=crd and will hand out IPs based on the IPs made
available via the custom resource.

An outside operator or user can provide IPs via a map spec.ipam.avaiable. The
agent will use those IPs for allocation as they become available an update the
map status.ipam.used to list all used IPs.

This is a minimal version targeted for use by the operator for ENI support.
Further versions of this could become more sophisticated to also target manual
configuration use cases or integration into other IPAM systems.

Signed-off-by: Thomas Graf <thomas@cilium.io>
The AWS ENI allocator builds on top of the CRD-backed allocator. Each node a
creates a ciliumnodes.cilium.io custom resource matching the node name when
Cilium starts up for the first time on that node. It contacts the EC2 metadata
API to retrieve instance ID, instance type, and VPC information and populates
the custom resource with this information. ENI allocation parameters are
provided as agent configuration option and are passed into the custom resource
as well.

The architecture ensures that only a single operator communicates with the EC2
service API to avoid rate-limiting issues in bigger clusters. A pre-allocation
watermark allows to have IP addresses available for use on nodes at all time
without requiring to contact the EC2 API when a new pod is scheduled in the
cluster.

The Cilium operator listens for new ciliumnodes.cilium.io custom resources and
starts managing the IPAM aspect automatically. It scans the EC2 instances for
existing ENIs with associated IPs and makes them available via the
spec.ipam.available field. It will then constantly monitor the used IP
addresses in the status.ipam.used field and automatically create ENIs and
allocate more IPs as needed to meet the IP pre-allocation watermark. This
ensures that there are always IPs available

The selection of subnets to use for allocation as well as attachment of
security groups to new ENIs can be controlled separately for each node. This
makes it possible to hand out pod IPs with differing security groups on
individual nodes.

Configuration
-------------

* The Cilium agent and operator must be run with the option --ipam=eni or the
  option ipam: eni must be set in the ConfigMap. This will enable ENI
  allocation in both the node agent and operator.

Cache of ENIs and Subnets
-------------------------

The operator maintains a list of all EC2 ENIs and subnets associated with the
AWS account in a cache. For this purpose, the operator performs the following
two EC2 API operations:

* DescribeNetworkInterfaces
* DescribeSubnets

The cache is updated once per minute or after an IP allocation or ENI creation
has been performed. When triggered based on an allocation or creation, the
operation is performed at most once every 15 seconds.

Publication of available ENI IPs
--------------------------------

Following the update of the cache, all CiliumNode custom resources representing
nodes are updated to publish eventual new IPs that have become available.

In this process, all ENIs with an interface index greater than
spec.eni.first-interface-index are scanned for all available IPs. All IPs found
are added to spec.ipam.available. Each ENI meeting this criteria is also added
to status.eni.enis.

If this updated caused the custom resource to change, the custom resource is
updated using the Kubernetes API methods Update() and/or UpdateStatus() if
available.

Determination of ENI IP deficits
--------------------------------

The operator constantly monitors all nodes and detects deficits in available
ENI IP addresses. The check to recognize a deficit is performed on two
occasions:

* When a CiliumNode custom resource is updated
* All nodes are scanned in a regular interval (once per minute)

When determining whether a node has a deficit in IP addresses, the following
calculation is performed:

spec.eni.preallocate - (len(spec.ipam.available) - len(status.ipam.used))

Upon detection of a deficit, the node is added to the list of nodes which
require IP address allocation. When a deficit is detected using the interval
based scan, the allocation order of nodes is determined based on the severity
of the deficit, i.e. the node with the biggest deficit will be at the front of
the allocation queue.

The allocation queue is handled on demand but at most every 5 seconds.

IP Allocation
-------------

When performing IP allocation for a node with an address deficit, the operator
first looks at the ENIs which are already attached to the instance represented
by the CiliumNode resource. All ENIs with an interface index greater than
CiliumNode.Spec.ENI.FirstInterfaceIndex are considered for use.

The operator will then pick the first already allocated ENI which meets the
following criteria:

* The ENI has addresses associated which are not yet used or the number of
addresses associated with the ENI is lesser than the instance type specific
limit.

* The subnet associated with the ENI has IPs available for allocation

The following formula is used to determine how many IPs are allocated on the ENI:

min(AvailableOnSubnet, min(AvailableOnENI, NeededAddresses + spec.eni.max-above-watermark))

This means that the number of IPs allocated in a single allocation cycle can be
less than what is required to fulfill spec.eni.preallocate.

In order to allocate the IPs, the method AssignPrivateIpAddresses of the EC2
service API is called. When no more ENIs are available meeting the above
criteria, a new ENI is created.

ENI Creation
------------

As long as an instance type is capable allocating additional ENIs, ENIs are
allocated automatically based on demand.

When allocating an ENI, the first operation performed is to identify the best
subnet. This is done by searching through all subnets and finding a subnet that
matches the following criteria:

* The VPC ID of the subnet matches spec.eni.vpc-id
* The Availability Zone of the subnet matches spec.eni.availability-zone
* The subnet contains all tags as specified by spec.eni.subnet-tags
If multiple subnets match, the subnet with the most available addresses is
selected.

After selecting the ENI, the interface index is determine. For this purpose,
all existing ENIs are scanned and the first unused index greater than
spec.eni.first-interface-index is selected.

After determining the subnet and interface index, the ENI is created and
attached to the EC2 instance using the methods CreateNetworkInterface and
AttachNetworkInterface of the EC2 API.

The security groups attached to the ENI will be equivalent to
spec.eni.security-groups. The description will be in the following format:

"Cilium-CNI (<EC2 instance ID>)"

ENI Deletion Policy
-------------------

ENIs can be marked for deletion when the EC2 instance to which the ENI is
attached to is terminated. In order to enable this, the option
CiliumNode.Spec.ENI.DeleteOnTermination can be enabled. If enabled, the ENI is
modifying after creation using ModifyNetworkInterface to specify this deletion
policy.

Node Termination
----------------

When a node or instance terminates, the Kubernetes apiserver will send a node
deletion event. This event will be picked up by the operator and the operator
will delete the corresponding ciliumnodes.cilium.io custom resource.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
This adds an option --auto-create-ciliumnode-resource to automatically create a
CiliumNode custom resource on startup with the ENI parameters automatically
derived from the AWS metadata API.

The created custom resource is bound to the lifecycle of the Kubernetes node
resource so it gets automatically deleted when the node is removed.

Signed-off-by: Thomas Graf <thomas@cilium.io>
This adds operator support for ENI allocation. See the previous commits for the
ENI details. The operator will respect the `--ipam=eni` just like the agent. In
general, it makes sense to also enable --enable-metrics to get ENI specific
allocation metrics.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
With the introduction of the ENI allocation capability, the operator must be
running in host networking mode as otherwise it would depends on its own
allocation capability in order to get scheduled.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Support specification of all ENI parameters via the CNI configuration file via
the option `--read-cni-conf`.

Also add support for a CNI chaining configuration in which case the
configuration is automatically derived from the Cilium specific section of the
chaining configuration

Signed-off-by: Thomas Graf <thomas@cilium.io>
This extends support to specify:
 - Priority
 - Mark mask
 - Source and destination filters

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
The ENI interface is automatically created and detected but needs to be brought
up and the MTU must be specified. The MTU is derived from the primary interface
by the agent and then inherited to all secondary ENIs.

Routing rules are set up to ensure that all traffic from endpoints are using
the ENI for egress which corresponds to the IP address assigned to the
endpoint.

Signed-off-by: Thomas Graf <thomas@cilium.io>
This adds ENI support to the masquerading logic. For external SNAT, the
destination address exclusion is derived from the VPC's primary CIDR.

For now, the list of interfaces to masquerade on must be specified with
--egress-masquerade-interfaces with an interface selector. This is required
because filtering all local ENIs with a static rule is not possible yet. This
will be resolved once we switch to BPF based masquerading.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Both are needed to perform CRD metric accounting for the IPAM CRD plugin.

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented Jun 26, 2019

CI passed, rebasing to merge.

@tgraf tgraf force-pushed the pr/tgraf/aws-eni-ipam branch from 5a3a9f9 to 7f45853 Compare June 26, 2019 01:17
@tgraf tgraf merged commit efa588e into master Jun 26, 2019
@tgraf tgraf deleted the pr/tgraf/aws-eni-ipam branch June 26, 2019 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement This would improve or streamline existing functionality. release-note/major This PR introduces major new functionality to Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS ENI support

8 participants