Skip to content

http: add ip src transparency#5255

Closed
klarose wants to merge 44 commits intoenvoyproxy:masterfrom
klarose:src_transparency
Closed

http: add ip src transparency#5255
klarose wants to merge 44 commits intoenvoyproxy:masterfrom
klarose:src_transparency

Conversation

@klarose
Copy link
Copy Markdown
Contributor

@klarose klarose commented Dec 10, 2018

Description:
This PR contains the bulk of the IP Src Transparency work for #4128. It is missing one piece to tie it together so that it may be enabled. I'll integrate into into the PR, or submit a new PR, when #5035 is merged.

This PR can be viewed as three major chunks of work:

  1. Some test code refactor to ease TDD of this feature. (Mainly splitting mocks out of big headers into their own).
  2. The core of the feature -- the WrappedConnectionPool and the SrcIpTransparentMapper.
  3. Integration of the core into Envoy's codebase.

The PR as it stands is very large. I'm willing to break it up if the reviewers so desire. Just let me know how you'd like me to go about doing it. We'd likely need to do the merges in the order I've provided above, though number 2 won't stand alone as described, since it itself requires a change or two to some public interfaces within envoy.

For the general approach, see Google Docs. While this does not 100% follow that document, the general approach matches.

In a nutshell:

We create a new type of ConnectionPool which wraps the Http connection pools. This connection pool uses abstract mapping logic to map a new stream request to one of many connection pools. These pools themselves are told at mapping time how to connect to their upstream using an interface added to the high level connection pool interface.

Connection pools are recycled into an "idle" list when all upstream connections on the pool have been drained. When a new stream is assigned to a pool, we try to use an idle stream first. If there are none, we assign a new one up to a limit. Multiple streams may be assigned to a pool if they are mapped to it concurrently.

General TODOs and notes:

  1. The integration tests are mostly complete. I need to do some infrastructure work (Add build scripts to run tests under a privileged container #5246) first before they'll run in circle.
  2. We cannot map http2 connections with the current approach . This is due to how I detect whether a pool is idle. I plan on reworking this part of the algorithm to distinguish between a pool being drained, and a pool being idle. The new approach will give performance benefits as well. But, I want to get this PR reviewed and merged before tackling that; it's already too large.
  3. I plan on implementing source port transparency in the future by either extending or abstracting the SrcIpTransparentMapper
  4. I also plan on refactoring the connection pool interface so that it applies to both http and tcp connections. This will allow us to reuse all of this logic for tcp just as well as http.
  5. The limit on the number of concurrent connection pools needs to be configurable. Similarly, we need to expose it via stats.

Risk Level: Medium
This is a large change, but it cannot yet be enabled.

Testing:

  • Fairly extensive unit tests of the new code.
  • A dedicated integration test of the new module is almost complete. It just needs some infrastructure work before a PR makes sense.
  • Some manual testing using envoy in a container talking to an http1 upstream in another container. I used a Proxy Protocol listener to control the fake source IP used by the upstream connection.

Docs Changes:
I need to document the feature:

  • How it works.
  • How to configure it.
  • How to monitor it.
    I'll do this in a future PR.

Release Notes:
This will need a release note so people are aware of the feature. Doesn't make sense just yet though.

This adds a wrapped connection pool whose purpose is to control how
connection pools are allocated and how connection are mapped to them.
These allocations and mappings depend on various criteria of the
request, chosen by an abstracted mapping, when a stream is first
handled.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
If src transparency is enabled on a cluster, we need to ensure that
requests from the same IP go to the same connection pool. We do this to
simply the stream matching criteria. The existing ConnectionPools
maintain a set of open connections to their upstream hosts. When a new
request comes in, one of those is chosen without the connection being
reestablished.

This is normally fine. However, when src transparency is enabled, we
expect the upstream connection's source IP, and maybe port, to match the
source IP/port of the downstream connection. This means that we need
logic to choose the upstream connection based on the request. This does
not exist in the connection pools. Rather than add that complexity to
the pools, we'll build selection logic into the wrapper so that pools
are married to specific connection information as long as necessary.
Then, the wrapper can handle the mapping logic the same way for all
types of connection pools. We can even change the mapping logic without
changing the pools! Magic!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
This class will eventually be responsible for mapping requests to
connection pools. Currently it is just a stub until we flesh out the
interface a bit more.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We want to abstract creation of the mapper itself for a few reasons:
 1) The complexity of creating it based off config/etc doesn't belong at
    its creation point
 2) It helps with test introspection.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Signed-off-by: Kyle Larose <kyle@agilicus.com>
This tests the connection pool factory by showing that it correctly
creates the wrapped connection pool when a transparent pool is desired.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We want to be able to pass the LoadBalancerContext to the connection
pools in case they use it to partition the contection. So, create a new
version of newStream that takes it as a parameter. We do this rather
than changing the existing newStream for two reasons:

1) We may know that we don't want to use the LoadBalancerContext when
calling newStream, so keep the old one around for that purpose
2) We don't need to change everywhere that uses the old method, since
for all intents and purposes, the old calls to newStream wouldn't use
LoadBalancerContext anyway!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We need access to the connection_mapper mock from multiple locations.
Consequently, it makes sense for connection_mapper to be in include, not
source. This refactors it out and updates the various locations that use
it.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
This should speed up compilation for tests that use only these mocks.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Starting to add concrete functionality using mock to test via the UT
infrastructure. Lots more to do!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Since the connection mapper knows when a pool has been newly assigned,
it can register a single callback to let it know when the pool is fully
drained. This allows us to tell anything that may have pending requests
that they can try again. Contrast this with the WrappedConnectionPool.
It doesn't know if a pool is newly assigned. So, it would have to
register with every stream, which isn't really efficient.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
The WrappedConnectionPool needs to store the load balancer context used
for a request when pushing a pending request since it needs to use it
later when trying to assign the request again. Extend the
ConnectionPoolImplBase to optionally track this information.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Progress: we now have pending requests! They are processed when an idle
callback is triggered from the mapper. We still need to do a bit more
work here, because the sub-pool itself could also make the connection
idle. In that case, we still need to propagate teh cancellation
callback.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Handle the case where the stream was pending, then it became pending
again when allocated to a pool.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
More work to minimize the size of headers in UTs to hopefully speed them
up.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We need to be able to handle failures to allocate pools/etc from the
wrapped pool. We also need to be able to pass that information back up
to our owner. Consequently, we need to wrap the ConnectionPool
callbacks. This change adds that logic.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
If the initial call to newStream results in a pool, but calling
newStream on it returns a Cancellable we need to make sure we wrap it so
that we can properly cleanup on a call to cancel from the parent.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
If there are waiting connections, we shouldn't drain. Don't!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
The connection pool could immediately invoke a callback as part of
creating a new stream. We need to make sure we properly clean up and
pass forward the callback in both cases where we create the new stream:
 1) When a pool is allocated on the first attempt
 2) When a pool is allocated after first failing.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
PendingWrapper no longer just tracks pending requests. It tracks the
entire 'stream' setup lifetime, so rename it to more appropriately
capture this.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
When we are told to drain, we need to in turn tell each of the
partitioned pools we control to drain. Do this through the mapper, which
actually knows about all the pools. Further, use the mapper to determine
whether or not all pools are idle so we can empty them out when checking
for drained as part of the idle callback.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We're going to be adding more wrapper code. Makes sense to put it into
its own subdirectory so we can easily view it all.

At the same time, fix up a UT that was failing due to some mock changes.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Currently just very simple stuff. More to come!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
The network mocks header is quite large. I only need access to the
connection mocks for my UT. So, split out the network mocks into their
own header to decrease test compile times.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We now map IPv4 addresses to pool. Next up, ipv6.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Now with extra bits!

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We can now idle pools so that they will be reused for future requests,
even if the address differs.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
A pool going idle is interesting, since it means that a request which
previously failed may now be assigned. Issue callbacks to any interested
parties when this happens.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
We can't transparently map the IP if there isn't a valid IP! Catch these
cases and throw an exception when they occur.

This may not be the best solution in the long term, but it's good enough
for now.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Forgot to clear out the optional v4/v6 address in the PoolTracker. This
could lead to us accidentally assigning a new pool for an IP that was
already active.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Minor cleanup for the PR

Signed-off-by: Kyle Larose <kyle@agilicus.com>
The signature for SrcIpTransparentMapper's constructor's definition
differed from its declaration. Make them match.

Signed-off-by: Kyle Larose <kyle@agilicus.com>
@zuercher
Copy link
Copy Markdown
Member

@lizan can you take a first pass at this?

@lizan
Copy link
Copy Markdown
Member

lizan commented Dec 12, 2018

Sure I will take first pass.

@klarose definitely want to see 1. as a separate PR so it can be merged sooner. The gdoc in the description is 404, can you update?

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 12, 2018

@lizan thanks for taking a look. I've fixed the link. Let me know if you still have issues. It's a relatively long document, covering more than the work I've done here. So, if you want to focus on this, just look at the design section near the end. If anything is unclear, or you'd like more of an explanation, let me know. I'll update the document/we can chat about it.

As for 1. I'll try to get that up in the next few hours.

@klarose klarose mentioned this pull request Dec 12, 2018
@jrajahalme
Copy link
Copy Markdown
Contributor

jrajahalme commented Dec 12, 2018

@klarose @lizan Have you looked at the hashKey() method of Envoy::Network::Socket::Option? It is an existing method in Envoy that allows downstream connection metadata (options) to essentially divide upstream connection pools into sub-pools that each have a separate key (as accumulated from the hashKey() methods of Socket::Options). This feature is an extension of the method by which Envoy keeps different connection pools for each priority class, and as such added minimal additional logic to the upstream implementation.

This functionality is currently used by Cilium to separate the upstream connection pools by the SO_MARK option we set on the upstream connections, but would be similarly applicable to ensure that upstream connections are not shared across different source addresses.

I have recently prototyped multiplexed transport on top of a normal TCP listener, where the original 5-tuple is carried in the beginning of each frame. I used a Socket::Option to reset the upstream source address to be the original downstream address and to shard the upstream pools essentially per source address.

IMO this could be used to implement source transparency so that a Socket::Option instance would set the upstream Socket's localAddress (using setLocalAddress()), set the SO_TRANSPARENT on the socket, and return the bytes needed in hashKey() to keep the upstream connections separated by the source address. A listener filter can be created to add these options to the downstream connections, from where they are already carried over to the upstream connection constructors (e.g., dispatcher's createClientConnection() method).

So, it seems to me that a single-proxy version of source transparency could be implemented entirely using the existing extension mechanisms already built into Envoy. It could be one more listener filter in source/extensions/filters/listener/ and be exposed to users as a well known filter name they add to their listener configuration.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 12, 2018

@jrajahalme That's certainly an interesting approach. I think it could work, as long as the pools are cleaned up fairly regularly. However, I have a few concerns with it that I think are addressed by the wrapped pool I have proposed:

  1. We'd end up creating an extra connection pool for each new external connection all the time. There would never be any reuse of the pools, meaning that the overhead of pool creation and destruction is guaranteed.
  2. There are no limits on the number of pools created this way. I worry that the map could bloat quite quickly. Obviously we could put some constraints on it, but then I'm not sure how we would work in the logic to give "pending" requests a chance without re implementing much of the pooling logic where it doesn't really belong.
  3. It wouldn't necessarily help with other "partitioning" approaches such as SNI. @mattklein123 asked that I develop a solution which could be extended to allow pooling based on the SNI. I don't think it's appropriate to put the SNI into a socket option, though others could disagree with me. :)

At the surface, I think your proposed approach is nicer. It's simple, and it uses the existing mechanisms as you've suggested. However, I worry that it won't scale as well as we'd like, and that we may need to build something similar in the future anyway.

What do the others think? I'm willing to revisit the design.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 12, 2018

@jrajahalme Out of curiosity, what are you using for the multiplexed transport? PROXY protocol?

@lizan
Copy link
Copy Markdown
Member

lizan commented Dec 13, 2018

It wouldn't necessarily help with other "partitioning" approaches such as SNI. @mattklein123 asked that I develop a solution which could be extended to allow pooling based on the SNI. I don't think it's appropriate to put the SNI into a socket option, though others could disagree with me. :)

#4973 adds TransportSocketOptions for SNI with hashKey approach to allow TcpConnectionPool connections to be partitioned by SNI, it left HTTP as a todo but similar approach based on that should work.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

@lizan @jrajahalme Alright. You have me convinced. It's worth taking a look at. I'll see what I can do today to prove it out.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

Hmm. I'm not sure this will cover all cases I'm interested in, at least as a listener filter. I want to be able to use the downstream remote address set by x-forwarder-for, which is set in void ConnectionManagerImpl::ActiveStream::decodeHeaders. This is a filter-chain filter, so it is invoked after the listener filters. Maybe I can make it work there...

EDIT: Looks like a network read filter may be the answer.

@jrajahalme
Copy link
Copy Markdown
Contributor

@klarose Http ConnPoolImpl already takes in Network::ConnectionSocket::OptionsSharedPtr& as a parameter, you should be able to use that for this purpose.

@jrajahalme
Copy link
Copy Markdown
Contributor

jrajahalme commented Dec 13, 2018

@klarose I've prototyped a multiplexed transport for kTLS, for which kernel-side eBPF program creates custom shim headers, basically just the 5-tuple and frame length in a binary representation. So, very similar to proxy protocol, but different :-)

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

@jrajahalme Thanks for the info. My current plan is to create a network filter which uses the ReadFilterCallbacks to get access to the socket option list. It'll add the option which sets the transparency/mark/etc as well as the source IP of the downstream.

Being a network filter, it can come after listener filters such as the proxy protocol filter, and the http connection manager (which sets the downstream from x-forwarded-for if it's trusted).

@jrajahalme
Copy link
Copy Markdown
Contributor

@klarose Interesting, I never realized you can have network filters after the http connection manager, I've kind of assumed that it will consume the payload, but maybe it does not matter for setting metadata like this.

zuercher pushed a commit that referenced this pull request Dec 13, 2018
While developing the source transparency feature, I became frustrated with both the compile time of some of the tests, and the fact that changes in seemingly unrelated headers caused many tests to recompile. Much of this was caused by the monolithic mocks headers we use. Since they include everything for a high level component, any header depended upon by that component, which is likely many, being changed can cause a recompilation. Further, simply including that much slows down compilation.

As an example, I found that splitting out the connection pool mocks decreased my unit test compile time from 13 seconds to 8 seconds. While not a drastic speedup, 5 seconds per compile is still quite a bit if you're doing TDD. You want feedback as quickly as possible!

This PR was split out of #5255 with some minor changes to remove code that only applied in that PR.

Risk Level: Low. This is just testing headers.
Testing: Ran the tests multiple times.
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Kyle Larose <kyle@agilicus.com>
@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

@jrajahalme Yeah, I think the filters work by doing some work up front when the connection is established, on all filters, then they proceed to do work with the data. I'll make these changes at connection establishment time. We'll see if it works. :)

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

@jrajahalme Looking at adding the new socket option to do this... Unfortunately, Socket::Option only has access to a Socket, which does not itself capture the concept of a local address -- a subclass of it (ConnectionSocket) does.

Ways around this I can think of offhand?

  1. Extend the visitor with another setOption taking a ConnectionSocket. Unfortunately, applyOptions is static, so we can't use inheritance to use the more concrete version of setOption. I guess I could make it a virtual method. I'd only have to change a few places to call socket_.applyOptions(options, ....) rather than Network::Socket::applyOptions(socket_ options,...)
  2. Use dynamic_cast (ick!)
  3. Add a parallel set of socket options called ConnectionSocketOptions. I'd rather avoid doing this, since it would involve a bunch of integration work, which I think we're trying to avoid here.

What do you think? I prefer 1, even though it's a little gross to pollute an interface with knowledge of its subclasses... That's the visitor pattern for you, though... I guess it's not really the interface we're polluting, since Option is a separate class. It just happens to be nested in Socket.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 13, 2018

It looks like createClientConnection doesn't use the sourceAddress of the connection. Should be simple to change, though.

    if (source_address != nullptr) {
      const Api::SysCallIntResult result = source_address->bind(fd());
      if (result.rc_ < 0) {
        ENVOY_LOG_MISC(debug, "Bind failure. Failed to bind to {}: {}", source_address->asString(),
                       strerror(result.errno_));
        bind_error_ = true;
        // Set a special error state to ensure asynchronous close to give the owner of the
        // ConnectionImpl a chance to add callbacks and detect the "disconnect".
        immediate_error_event_ = ConnectionEvent::LocalClose;

        // Trigger a write event to close this connection out-of-band.
        file_event_->activate(Event::FileReadyType::Write);
      }

sourceAddress comes from the clusterInfo. The pre-bind options are set just before this. So, I could check if the local address is not null. If it is, I'll override the cluster address.

@jrajahalme
Copy link
Copy Markdown
Contributor

jrajahalme commented Dec 14, 2018

@klarose Network::Socket has the local address, but not the method to set it. IMO we should add a setLocalAddress(const Address::InstanceConstSharedPtr& local_address) method to Network::Socket.

I had used a dynamic_cast in the prototype, but did not remember that fact, sorry.

I'd also want to see a non-null value returned by Socket::localAddress() override the cluster's source address setting.

@mattklein123 mattklein123 self-assigned this Dec 14, 2018
@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 14, 2018

I have it working for PROXY protocol. However, it doesn't work when I want to consume the XFF. This makes sense, now that I think about it -- the XFF is only set once all the http headers have been parsed, and that's too late. So, I think I need to make two filters:

  1. Listener filter, gets the address from remoteAddress of the connection
  2. An http decoder filter, gets the address from StreamInfo::downstreamRemoteAddress.

The listener filter can be used with ProxyProtocol, for tcp or http. The http decoder can only be used for http.

Sad, because it's more work than I anticipated, and I need to do some refactoring, but that's life I guess. :/

I'll submit a PR for the listener one, so we can take a look, then get on to implementing the http one.

@klarose
Copy link
Copy Markdown
Contributor Author

klarose commented Dec 20, 2018

Cancelling this as it has been subsumed by the work for another approach (#5337)

@klarose klarose closed this Dec 20, 2018
fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
While developing the source transparency feature, I became frustrated with both the compile time of some of the tests, and the fact that changes in seemingly unrelated headers caused many tests to recompile. Much of this was caused by the monolithic mocks headers we use. Since they include everything for a high level component, any header depended upon by that component, which is likely many, being changed can cause a recompilation. Further, simply including that much slows down compilation.

As an example, I found that splitting out the connection pool mocks decreased my unit test compile time from 13 seconds to 8 seconds. While not a drastic speedup, 5 seconds per compile is still quite a bit if you're doing TDD. You want feedback as quickly as possible!

This PR was split out of envoyproxy#5255 with some minor changes to remove code that only applied in that PR.

Risk Level: Low. This is just testing headers.
Testing: Ran the tests multiple times.
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Kyle Larose <kyle@agilicus.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants