perf: provide custom implementation for creating UUIDs.#1401
perf: provide custom implementation for creating UUIDs.#1401RomanDzhabarov merged 9 commits intoenvoyproxy:masterfrom
Conversation
This change yield ~5x performance improvement for UUID generation, which results in ~30% performance improvement when proxying HTTP/2 traffic on localhost: Before: UUIDUtilsTest.benchmark: 2003 ms h2load -c 100 -n 100000: 26467 req/s After: UUIDUtilsTest.benchmark: 408 ms h2load -c 100 -n 100000: 34099 req/s As a bonus, this implementation is fully portable and depends only on BoringSSL's crypto. Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
Similar results on CI: Before (current After (this PR): @mattklein123 @htuch PTAL. Please note that ASAN test failed to build 1/146 tests because of |
htuch
left a comment
There was a problem hiding this comment.
Awesome, this is a great improvement. When we do performance work, we've typically been disabling the UUID header due to this known issue.
| // Create UUID from Truly Random or Pseudo-Random Numbers. | ||
| // See: https://tools.ietf.org/html/rfc4122#section-4.4 | ||
| uint8_t rand[16]; | ||
| RAND_bytes(rand, 16); |
There was a problem hiding this comment.
Would it make more sense to use RAND_pseudo_bytes() here? Presumably it's faster and we don't need something cryptographically strong, UUIDs don't need to be unpredictable in the strong crypto sense (the RFC explicitly calls out pseudo random).
There was a problem hiding this comment.
My thinking on this is to keep using real random, but actually do the following:
- Store the random bytes in a
static thread_localvariable along with an iteration counter. - Select 2 bytes that we will sequence through in order.
- Basically fill in 2 bytes with the sequence, meaning we will only need to refresh the random every ~64K UUIDs. (This is arbitrary, feel free to use less bits and make it like 16K or something).
This should further improve performance by several orders of magnitude and for the purposes that Envoy uses random UUIDs for I think will be just fine from a collision standpoint.
There was a problem hiding this comment.
@htuch: RAND_pseduo_bytes() in just an alias to RAND_bytes() in BoringSSL (see: crypto/fipsmodule/rand/rand.c#269).
@mattklein123: I agree with the prefetching idea (but perhaps we could do it in subsequent PR?). However, I'm afraid I don't follow the 2 bytes idea... you need 16 bytes of randomness (sans 6 constant bits) for UUID version 4.
There was a problem hiding this comment.
Fine with me to make the optimizations a separate PR. I think the idea here is that since the original UUID is reasonably random, incrementing it is unlikely to produce a conflict with any other UUID in existence (and none produced locally) until you overflow, at which point you regenerate with fresh entropy.
There was a problem hiding this comment.
Yes that's what I meant re: further optimization. Fine to do in a follow up PR.
There was a problem hiding this comment.
Thanks, I'll do that in a separate PR. I have a funny feeling about 2 vs 16 bytes, so I'll see if the extra gains are worth it.
There was a problem hiding this comment.
I think we might be talking past each other a bit. I'm suggesting allocating ~14 bytes of entropy, and then filling in the last 2 bytes with an incrementing integer. (or some other combination lengths). Anyway we can discuss in a follow up.
| // Create UUID from Truly Random or Pseudo-Random Numbers. | ||
| // See: https://tools.ietf.org/html/rfc4122#section-4.4 | ||
| uint8_t rand[16]; | ||
| RAND_bytes(rand, 16); |
There was a problem hiding this comment.
Can you check the return code from this as well?
There was a problem hiding this comment.
RAND_bytes() in BoringSSL always returns 1 (see: crypto/fipsmodule/rand/rand.c#265), so there is no point in checking that.
There was a problem hiding this comment.
I'd suggest an ASSERT or RELEASE_ASSERT on the value, since it feels like we're relying on implementation detail that might conceivably change. Up to you which one.
There was a problem hiding this comment.
The only reason why RAND_bytes() returns anything and isn't declared as void is for backward compatibility with OpenSSL. If you look at other functions, i.e. new void RAND_bytes_with_additional_data() which RAND_bytes() calls internally (crypto/fipsmodule/rand/rand.c#264), it doesn't return anything, so adding check here will just waste CPU cycles, and it won't be optimized out by the compiler, because we don't do LTO.
I'd agree with you if we linked against system's OpenSSL library, but we're statically linking against known BoringSSL version, so this won't be checking anything that can fail.
Having said that, I can add ASSERT() if you feel strongly about it.
There was a problem hiding this comment.
Can we use these new interfaces instead?
There was a problem hiding this comment.
RAND_bytes_with_additional_data() is not exported, and RAND_bytes() is the only non-deprecated interface available.
There was a problem hiding this comment.
Done (per offline conversation).
|
|
||
| uuid.push_back('-'); | ||
|
|
||
| for (unsigned i = 8; i < 10; i++) { |
There was a problem hiding this comment.
Nit: prefer concrete fixed size type, e.g. uint32_t.
| uuid.push_back('-'); | ||
|
|
||
| for (unsigned i = 8; i < 10; i++) { | ||
| uint8_t d = rand[i]; |
mattklein123
left a comment
There was a problem hiding this comment.
This is awesome, thanks for doing this. Some comments. Also, this will fix #249.
| // See: https://tools.ietf.org/html/rfc4122#section-4.4 | ||
| uint8_t rand[16]; | ||
| RAND_bytes(rand, 16); | ||
| rand[6] = (rand[6] & 0x0f) | 0x40; |
There was a problem hiding this comment.
Can you add a comment here that this is a version 4,variant 1 UUID. Also, can you add tests for these constants in the distribution test or another test?
| EXPECT_EQ(7295, result); | ||
| } | ||
|
|
||
| TEST(UUIDUtilsTest, benchmark) { |
There was a problem hiding this comment.
IMO I would make this disabled by default.
There was a problem hiding this comment.
But if it's disabled, then it's never run...
There was a problem hiding this comment.
My point mainly was that this "test" doesn't actually test anything. There are no assertions, etc. It will just burn CPU and always pass. (And from a perf perspective it will always return a different result depending on where it is run). So, typically for this type of thing I would make it a DISABLED test so that it can be run for local testing, but doesn't waste time in general.
| // Create UUID from Truly Random or Pseudo-Random Numbers. | ||
| // See: https://tools.ietf.org/html/rfc4122#section-4.4 | ||
| uint8_t rand[16]; | ||
| RAND_bytes(rand, 16); |
There was a problem hiding this comment.
My thinking on this is to keep using real random, but actually do the following:
- Store the random bytes in a
static thread_localvariable along with an iteration counter. - Select 2 bytes that we will sequence through in order.
- Basically fill in 2 bytes with the sequence, meaning we will only need to refresh the random every ~64K UUIDs. (This is arbitrary, feel free to use less bits and make it like 16K or something).
This should further improve performance by several orders of magnitude and for the purposes that Envoy uses random UUIDs for I think will be just fine from a collision standpoint.
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
This way benchmark uses already initialized PRNG. Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
(Also, if you merge with master, you should be able to pickup the |
|
LGTM other than @htuch comments and my comment about disabling the benchmark test. |
RomanDzhabarov
left a comment
There was a problem hiding this comment.
LGTM, nice improvement!
| for (uint8_t i = 0; i < 4; i++) { | ||
| const uint8_t d = rand[i]; | ||
| uuid.push_back(hex[d >> 4]); | ||
| uuid.push_back(hex[d & 0xf]); |
There was a problem hiding this comment.
nit: d & 0x0f, this might be slightly more clear (clear 4 most significant bits and keep last 4 bits in a byte).
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
Bummer, the commit message was removed during merge. Could we please not do that in the future (or more precisely, could the person merging PR make sure that the commit message is retained)? Thanks. |
|
@RomanDzhabarov please see the nodes on cleaning up the commit message in https://github.com/lyft/envoy/blob/master/CONTRIBUTING.md. The goal is to clean it up but not delete it. Thanks. |
|
@PiotrSikora yeah, sorry that's my bad. |
This makes the test test/java/integration:android_engine_start_test to pass when running on Linux Signed-off-by: Charles Le Borgne <cleborgne@google.com> Risk Level: none Testing: N/A Docs Changes: N/A Release Notes: N/A Signed-off-by: JP Simard <jp@jpsim.com>
This makes the test test/java/integration:android_engine_start_test to pass when running on Linux Signed-off-by: Charles Le Borgne <cleborgne@google.com> Risk Level: none Testing: N/A Docs Changes: N/A Release Notes: N/A Signed-off-by: JP Simard <jp@jpsim.com>
Some compiler warnings are caused by this lack of information to compiler. Refers to envoyproxy#1401
This fixes couple gcc warnings (array bounds) when accessing page map radix tree. Refers to envoyproxy#1401
In heap checker we actually use test_str address after deletion, to verify at runtime that heap checker is indeed tracking deletions. But gcc barks. So lets hide allocation from it by calling tc_newarray/tc_deletearray. Refers to envoyproxy#1401
Our tests do explicitly trigger certain "bad" cases. And compilers are getting increasingly smarter, so they start giving us warnings. People dislike warnings, so lets try to disable them specifically for unit tests. Refers to issue envoyproxy#1401
This change yield ~5x performance improvement for UUID generation,
which results in ~30% performance improvement when proxying HTTP/2
traffic on localhost:
Before:
UUIDUtilsTest.benchmark: 2003 ms
h2load -c 100 -n 100000: 26467 req/s
After:
UUIDUtilsTest.benchmark: 408 ms
h2load -c 100 -n 100000: 34099 req/s
As a bonus, this implementation is fully portable and depends only
on BoringSSL's crypto.
Partially fixes #249.