filter: implemented gzip http filter#2087
Conversation
|
Awesome! @dnoe can you own the review on this one? |
dnoe
left a comment
There was a problem hiding this comment.
Looks good! Just a few comments to pick through.
There was a problem hiding this comment.
Use skip_compression_ to match the other member name style.
There was a problem hiding this comment.
I think it can be a const method.
There was a problem hiding this comment.
Any particular reason for making memory_level configurable but not window_bits?
There was a problem hiding this comment.
window_bits changes the encoding type. I'm not sure when user will want to change it in the case of GZIP.
There was a problem hiding this comment.
It might make more sense to do the string->enum resolution at construction time and then store the enum here instead. You'd still retain the above functions but make them static functions that take the string and return enum, and call them in the ctor initializer list.
There was a problem hiding this comment.
'qvalue' is described in the RFC, e.g Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
There was a problem hiding this comment.
This looks like it'll return true if there isn't an exact match but a subset of the content type is contained in the white list. Is that definitely what you want?
There was a problem hiding this comment.
Not really, I fixed it by providing a prefixed list of content-types that the user can select it from in the config.
|
Thanks @dnoe for reviewing it. I will take look at that over the weekend. |
|
@dnoe I did the initial changes that you suggested. I also included the part of the RFC2616 that was missing plus some more robust configuration options. Please take another look when you have a chance. Thanks! |
dnoe
left a comment
There was a problem hiding this comment.
Mostly looks good, just a few nits and comments. Still needs documentation for the filter config but it looks like that's meant to come in a follow up PR.
There was a problem hiding this comment.
My understanding is that std::regex::optimize flag makes matching faster at the expense of slower construction. Since this uses a temporary it'll still be doing construction for each call so this trade off doesn't seem to make sense. Thoughts?
There was a problem hiding this comment.
(I think ideally you could have this std::regex live the same lifespan as Envoy itself, but watch out for the "static initialization order fiasco")
There was a problem hiding this comment.
Today I learned that const auto and auto const are equivalent! I think everywhere else in Envoy you'll see it written as const auto, so please consider changing it as a nit.
dnoe
left a comment
There was a problem hiding this comment.
Awesome, CONSTRUCT_ON_FIRST_USE is perfect. Just a few things.
There was a problem hiding this comment.
Please keep this namespace comment in.
There was a problem hiding this comment.
It looks like that namespace Http was in a wrong place. It should be in the line 128 only.
There was a problem hiding this comment.
What's the purpose of this namespace?
There was a problem hiding this comment.
I just realized that this doesn't apply to c++11 or up anymore. I will remove it.
|
Looks good to me. Did you see @mattklein123 comment about making it V2 ready? Let me know here or on Slack if you have questions about that and what needs to be done in the data-plane-api repository. |
|
@dnoe I started working on proto yesterday. I reach you on Slack if I have any question. Thank you again for reviewing it! |
5229c58 to
cd5e964
Compare
|
@dnoe Please take another look whenever you can. Thanks! |
|
Will take a look at this today (sorry I didn't managed to get to it before the weekend!) |
|
No worries! I also had a busy week and couldn't send this before Thursday night. Take your time! |
dnoe
left a comment
There was a problem hiding this comment.
Looks good! A few remaining things to look at.
There was a problem hiding this comment.
I think there's a simpler way to initialize these which will eliminate the constructor body and allow you to make all the members of the GzipFilterConfig const. See the part about RepeatedPtrField in https://developers.google.com/protocol-buffers/docs/reference/cpp-generated
This should work: cache_control_values_(gzip.cache_control().cbegin(), gzip.cache_control.cend())
There was a problem hiding this comment.
I knew that repeated provides STL-like iterators, but I didn't realized that it could be initialized like that. Cool, thanks! 👍
There was a problem hiding this comment.
I think this can be done with a switch statement since what comes from the proto is a true enum. It'll probably be a bit more readable code than the if statements.
There was a problem hiding this comment.
I've tried switch here, but for some reason didn't work. Let me try it again.
There was a problem hiding this comment.
Same for this, if a switch statement works prefer it to repeated if statements.
There was a problem hiding this comment.
Can you add a comment about where these values come from?
There was a problem hiding this comment.
I added this to get rid of the magic number, but good call about the comment.
|
@mattklein123 It's ready. Would be great to have yours or @htuch feedback on this. |
There was a problem hiding this comment.
It looks like you are not using a fast lookup here. Why not pull the content type up and use config_->contentTypeValues().find()?
Maybe it's because content-type could be something like "text/html; charset=utf8", but can't you parse out the substring of the content-type before the ";" and then do the unordered_set find?
Moreover, if you had a compressible type "foobar" but you had content-type:foo that would match this impl, if I'm reading it correctly. In practice that might not be an issue.
There was a problem hiding this comment.
why would cache-control influence the compressibility of something? Except maybe cache-control:no-transform.
There was a problem hiding this comment.
It does not affect compression, but the necessity for such. Since the response is going to be cached, there are situations where skip compression might be desirable. Less common I believe, envoy could be behind another proxy that might cache certain content and then serve it compressed.
There was a problem hiding this comment.
RE "response is going to be cached", I think that's what Vary:Accept-Encoding is for. HTTP caches should be able to hold a gzipped response and serve it to a non-gzip-accepting client by decompressing it on the fly, if that header is in the cached response. https://blog.stackpath.com/accept-encoding-vary-important is worth a read.
RE "behind another proxy" -- I think in that case the site admin would just disable gzipping in Envoy then, right? Why would it need to be based on cache-control headers?
I'm not objecting to this setting, I just want to better understand what the usage model is and its relationship to caches.
There was a problem hiding this comment.
Thanks for pointing this out Joshua.
I think that Vary should be also available in the configuration. In regards to cache-control, I was thinking more in in terms of proxied requests, like Nginx does on the presence of Via request header. See gzip_proxied at http://nginx.org/en/docs/http/ngx_http_gzip_module.html
Thoughts?
There was a problem hiding this comment.
OK, sounds like you are going for a drop-in replacement for nginx gzip support? Fair enough. I don't know why someone would configure cache-control's effect on gzipping in an nginx proxy either, but it must have been put in for a reason. In that case, let's just do it efficiently. In the case of cache-control, I think it means splitting any number of attributes on ",", and doing a hash lookup in your set on each of them. E.g. cache-control: no-cache, no-store, must-revalidate, max-age=0 should (IMO) result in 4 exact hash lookups.
The most relevant setting (AFAICT), cache-control:no-transform, is not mentioned in the nginx doc you referenced, and I don't see it in your PR either. By default, IMO, you should honor cache-control:no-transform and avoid gzipping any response with that header.
I think it's very important for proxies to get the defaults right, especially Vary:accept-encoding, as messing these up can result in hard-to-diagnose downstream failures.
There was a problem hiding this comment.
Good point on having cache-control:no-transform by default. I tried to make the feature as least opinionated as possible, so the users could opt for skipping compression on whatever cache-control value/s they need to, including no-transform.
I will work on these changes, thanks!
There was a problem hiding this comment.
I'd defer to @htuch and @mattklein123 and @alyssawilk on the design philosophy here, as I am still new to Envoy development. A few options are:
- If nginx's gzip module does it, then Envoy's should too.
- Provide mechanism not policy about how the proxy should behave, and let the config figure out what to do (ie "least opinionated as possible)
- Provide a paved path toward a few different sensible scenarios
There is some subtlety here that I think is worth getting right in the C++ code, so I kind of favor option 3, and these scenarios come to mind:
a. Enable gzip with some level of compression aggressiveness, respecting cache-control:no-transform from origin, and adding vary:accept-encoding to the response to the client.
b. Disable gzip completely, e.g. because the Envoy being configured is sitting in a high-bandwidth rack and some other proxy will gzip en route to client.
But I totally understand if the market for Envoy wants it to be a clean drop-in replacement for nginx and whatever features that has in its gzip setup.
The challenge with #2 above (least opinionated) is that you then need the person configuring Envoy to know the HTTP semantics and best practices, and enforce them via config.
There was a problem hiding this comment.
why this complicated constant rather than just "{}"?
There was a problem hiding this comment.
Thanks, this looks a lot better.
I'm not super-intuitive about regexes. Will this strip any extra whitespace too?
Note could also write this with str::find and some sort of string-view-trimming helper function (if such exists in absl::). That might be faster.
In any case this function deserves some direct unit-testing wihout waking up the rest of the gzip filter.
There was a problem hiding this comment.
use .empty() to check whether it's empty, rather than taking the size(), which might in theory be a non-trivial calculation, depending on the STL impl.
There was a problem hiding this comment.
probably worth putting TransferEncoding() in a temp as I think it's actually a lookup.
There was a problem hiding this comment.
use !empty() rather than size()
There was a problem hiding this comment.
probably worth doing a pass over all these methods and storing header lookups in temps rather than doing them twice.
There was a problem hiding this comment.
There is a great chance that the second call to TransferEncoding() might never happen, reason why this is also the second last checking in the if statement. Do you think that it's worth doing all these assignments to temps?
There was a problem hiding this comment.
IMO yes...HeaderMap's impl currently instantiates a discrete member variable per common header so the lookup is fast, but I'm wondering if that might be something that gets re-tuned later as that policy might be a drag on header construction speed as the number of potentially interesting headers grows. And in any case the temp is free, so I don't see a downside to a temp in readability or performance.
There was a problem hiding this comment.
note there is also "deflate"(which isn't used much anymore and "br" (brotli) which is "up and coming", though I'm not sure how fast :)
There was a problem hiding this comment.
Sorry, comment should be more specific -> "Since gzip is the only available content-encoding in Envoy at the moment".
I haven't look at Brotli specification yet, but I saw that it's supposed to be superior to gzip which is quite impressive.
There was a problem hiding this comment.
use anon namespace rather than 'static' (from google style guide).
There was a problem hiding this comment.
That's what I initially did, but it got caught in previous review. So, I thought it wasn't relevant anymore and ended up moving back to static.
There was a problem hiding this comment.
re-reading https://google.github.io/styleguide/cppguide.html#Unnamed_Namespaces_and_Static_Variables it does not forbid 'static', but within the context of this style guide, everyone I know that writes C++ based on this guide interprets anonymous namespace to be the preferred mechanism here, even though they have the same effect. I think this interpretation might have been due to an earlier version of the styleguide or because the only example given is with an anonymous namespace.
mattklein123
left a comment
There was a problem hiding this comment.
This is super sweet. Thanks for the high quality PR. A couple of random comments. I'm going to defer to @jmarantz for logic review as I have very little familiarity with high level web stuff so can't help that much with the various headers and when to use them.
@gsagula do you mind finalizing the docs PR in data-plane-api? (unhide filter docs, clean them up, etc.) Then we can do a final pass on everything together.
Thank you!
source/common/config/filter_json.cc
Outdated
There was a problem hiding this comment.
technically this is taking a reference to a temporary. I would lose the & on the getString calls. Same below.
There was a problem hiding this comment.
nit: const GzipFilterConfigSharedPtr&
There was a problem hiding this comment.
nit: I would just define this in the header file.
There was a problem hiding this comment.
nit: The initialization here is redundant and can be removed.
|
@mattklein123 Working on all the above. @jmarantz I'm happy to go with the design changes that you suggested earlier. I will open another PR in data-plane to unlock docs, so we can work on the config changes from there. Thanks for reviewing it. |
jmarantz
left a comment
There was a problem hiding this comment.
This is looking better, but I'm still looking for the Vary:Accept-Encoding addition.
source/common/common/utility.cc
Outdated
There was a problem hiding this comment.
If you use absl::string_view you will not have to copy/allocate any string data for this operation.
https://github.com/abseil/abseil-cpp/blob/master/absl/strings/string_view.h
absl is already incorporated into the Envoy build.
I admit that you may need to construct a temp std::string later to do the hash lookup though, but in one case it would at least save you one intermediate copy. You could also avoid that intermediate copy by using std::unordered_setabsl::string_view but you'll need to keep the backing store for the strings separately then. Not sure if that's worth it, but it would save allocations/string-copying per request in the server.
There was a problem hiding this comment.
don't forget to trim the tokens. If you just split on "," you'll wind up with tokens like " max-age=0" and " public". For some reason, it's common to have "public, max-age=300" in headers.
I'm not really sure if splitting off the max-age value is really a concern. E.g. would someone want to say "I'm willing to gzip, but only if it's cached less than 5 minutes"?
But you need to strip the whitespace so you can find "no-transform".
There was a problem hiding this comment.
IMO yes...HeaderMap's impl currently instantiates a discrete member variable per common header so the lookup is fast, but I'm wondering if that might be something that gets re-tuned later as that policy might be a drag on header construction speed as the number of potentially interesting headers grows. And in any case the temp is free, so I don't see a downside to a temp in readability or performance.
There was a problem hiding this comment.
re-reading https://google.github.io/styleguide/cppguide.html#Unnamed_Namespaces_and_Static_Variables it does not forbid 'static', but within the context of this style guide, everyone I know that writes C++ based on this guide interprets anonymous namespace to be the preferred mechanism here, even though they have the same effect. I think this interpretation might have been due to an earlier version of the styleguide or because the only example given is with an anonymous namespace.
|
@jmarantz Thanks again for the detailed review. I have my eyes on the comments above. I'm back to work tomorrow, so I won't be able to look at them till evening. I did some modifications in gzip.proto (not in use yet) based on our previous discussions:
Note that I will also include verification for "Cache-Control: no-transform". Please take a look at whenever you have time: |
Signed-off-by: Gabriel <gsagula@gmail.com>
Signed-off-by: Gabriel <gsagula@gmail.com>
Signed-off-by: Gabriel <gsagula@gmail.com>
Signed-off-by: Gabriel <gsagula@gmail.com>
…ted api SHA Signed-off-by: Gabriel <gsagula@gmail.com>
|
Manual system-testing with real browser & envoy as a forward-proxy @jmarantz I think we are done here, but feel free to give it another pass. |
mattklein123
left a comment
There was a problem hiding this comment.
Epic PR! LGTM from a quick skim. @jmarantz do you mind doing the final pass on this one?
source/common/config/filter_json.cc
Outdated
| *proto_config.mutable_ip_white_list()); | ||
| } | ||
|
|
||
| void FilterJson::translateGzipFilter(const Json::Object& json_config, |
There was a problem hiding this comment.
Did we decide to keep v1 support or are we dropping it? If we are keeping it we should add docs here: https://www.envoyproxy.io/docs/envoy/latest/api-v1/http_filters/http_filters. My preference is to drop it though and encourage people to upgrade to v2. Either way is fine.
There was a problem hiding this comment.
We decided to drop V1 as per Harvey's recommendation, but I forgot to remove it from Envoy code. I'll do that tomorrow 👀
|
Oh also needs a master merge @gsagula |
Signed-off-by: Gabriel <gsagula@gmail.com>
|
Thanks @mattklein123 |
|
@jmarantz Lastest commit addresses all the places where code needs to handle case-insensitive header values. I tried to extend xxHash instead of defining a new one, but I ran out of time. I'll get back to it at some point. |
|
xxHash vs a new one can be a TODO. RE rebasing: this is a bit of the blind leading the blind here, but I don't think you need to rebase. You need to merge. But I might have that wrong. I attempted to do that flow for one of my own branches and I think it it did the right thing. |
|
In any case, please go ahead & merge so I can see the final state before my walkthrough. I'd recommend saving a simple tree of all your actual code somewhere outside of git's control :) |
Signed-off-by: Gabriel <gsagula@gmail.com>
|
I think it still needs to merge upstream -> git merge upstream/master. Sounds good. Resolving last conflicts. |
Signed-off-by: Gabriel <gsagula@gmail.com>
|
@jmarantz It's ready for another pass. Whenever you get a chance. Thanks! |
Signed-off-by: Gabriel <gsagula@gmail.com>
mattklein123
left a comment
There was a problem hiding this comment.
Few more v1 reverts needed I think?
| : v1_converter_({BUFFER, CORS, DYNAMO, FAULT, GRPC_HTTP1_BRIDGE, GRPC_JSON_TRANSCODER, | ||
| GRPC_WEB, HEALTH_CHECK, IP_TAGGING, RATE_LIMIT, ROUTER, LUA, | ||
| EXT_AUTHORIZATION}) {} | ||
| : v1_converter_({BUFFER, CORS, DYNAMO, ENVOY_GZIP, FAULT, GRPC_HTTP1_BRIDGE, |
There was a problem hiding this comment.
We should revert this change.
source/common/json/config_schemas.cc
Outdated
| } | ||
| )EOF"); | ||
|
|
||
| const char Json::Schema::GZIP_HTTP_FILTER_SCHEMA[] = R"EOF( |
source/common/json/config_schemas.h
Outdated
| static const std::string BUFFER_HTTP_FILTER_SCHEMA; | ||
| static const std::string FAULT_HTTP_FILTER_SCHEMA; | ||
| static const std::string GRPC_JSON_TRANSCODER_FILTER_SCHEMA; | ||
| static const char GZIP_HTTP_FILTER_SCHEMA[]; |
source/server/config/http/gzip.cc
Outdated
|
|
||
| HttpFilterFactoryCb GzipFilterConfig::createFilterFactory(const Json::Object& json_config, | ||
| const std::string&, FactoryContext&) { | ||
| envoy::config::filter::http::gzip::v2::Gzip proto_config; |
There was a problem hiding this comment.
I think this version can never happen if we drop v1? NOT_IMPLEMENTED?
Signed-off-by: Gabriel <gsagula@gmail.com>
Signed-off-by: Gabriel <gsagula@gmail.com>
|
@mattklein123 Done with V1 cleanup. Sorry, I tried to do that very quickly this morning so that you or Joshua could take another look, but I missed some spots. |
jmarantz
left a comment
There was a problem hiding this comment.
this looks great, thanks for pushing this all the way through!
mattklein123
left a comment
There was a problem hiding this comment.
Awesome work! Thanks for the great contribution.
* use route directive regardless of rpc status Signed-off-by: Kuat Yessenov <kuat@google.com> * log response code Signed-off-by: Kuat Yessenov <kuat@google.com>
| static uint64_t djb2CaseInsensitiveHash(absl::string_view input) { | ||
| uint64_t hash = 5381; | ||
| for (unsigned char c : input) { | ||
| hash += ((hash << 5) + hash) + absl::ascii_tolower(c); |
There was a problem hiding this comment.
Seems like an unintended deviation from the original DJB2 hash having:
hash += ((hash << 5) + hash) + absl::ascii_tolower(c);
instead of:
hash = ((hash << 5) + hash) + absl::ascii_tolower(c);
While at it, a typo in ingnoring.
There was a problem hiding this comment.
@andre-rosa can i suggest opening a PR with your suggested change - seems like the typo is fixed already so i would suggest looking at current main
Description:
This PR implements http filter that compresses data dispatched from an upstream service upon client request. To enable this filter, create an object literal under filters with name field “gzip” and an empty config object for minimum configuration. Envoy will deliver compressed content to any http request that contains gzip in accept-encoding header.
#269
Risk Level: Medium
Testing: unit testing, integration testing and manual testing thru envoy front-proxy.
Note: envoy::api::v2::filter will be included in this PR.