perf: optimize random number generation and populate all 64 bits. by PiotrSikora · Pull Request #1812 · envoyproxy/envoy

PiotrSikora · 2017-10-05T02:13:10Z

Before:
Random.benchmarkRandom: 55,713 ms

After:
Random.benchmarkRandom: 800 ms

Before: Random.benchmarkRandom: 55,713 ms After: Random.benchmarkRandom: 800 ms Fixes envoyproxy#1798. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

PiotrSikora · 2017-10-05T02:37:37Z

@mattklein123 @htuch @akonradi this supersedes #1799.

akonradi · 2017-10-05T13:33:32Z

Cool, works for me.

htuch · 2017-10-05T13:41:06Z

source/common/runtime/runtime_impl.cc

+  // Consume sizeof(uint64) bytes from the buffer.
+  ASSERT(buffered_idx + sizeof(uint64_t) <= sizeof(buffered));
+  uint64_t rand;
+  std::memcpy(reinterpret_cast<void*>(&rand), &buffered[buffered_idx], sizeof(uint64_t));


Nit: is this really more readable or faster than a simple assignment?

This is the correct way to do it, iirc, doing otherwise is flirting with undefined behavior. The compiler will optimize out the memcpy.

If you make buffered an array of uint64_t, there is absolutely no undefined behavior.

I agree with @htuch .. Why can't we totally avoid the copy, and just have a single integer to track the array consumption, and return buffered[idx] ? Is there some compiler issue at play here? And if so, (another uninitiated question), is it bad to have a global (dynamically allocated array) of these integer indices/and arrays, [one per thread], and use the thread specific slot? Cache misses wont be bad as long as the thread stays on the same core.

@rshriram a global array of indices could be less efficient since the indices for different threads would be on the same cache line (at least on modern x86) so the cache line would bounce between cores unnecessarily.

Yes, please just keep is simple. Use a thread local array of uint64_t which will be properly aligned, then just return the current entry and advance without a local copy.

Ah shoot yes.. the index array would end up causing lots of false sharing. But the question of 2k vs 4k vs 8k byte array size remains..

Duh! Array of uint64_t is indeed a better choice. Fixed.

htuch · 2017-10-05T13:42:11Z

source/common/runtime/runtime_impl.cc

+  buffered_idx += sizeof(uint64_t);
+
+  return rand;
+}


Can we update uuid to use this function (i.e. factor out the RAND_bytes pool)?

+1, while we are here, it's possible there is a slight perf bump if you used a struct for both thread local variables. I'm not sure of the compiler optimizes this or not or goes through the thread local register multiple times. Not a big deal.

This is moot now, because of the migration to array of uint64_t.

Isn't 16 bytes just 2 uint64_t via random()?

Yes, it is, but refactoring uuid() to use random() that returns uint64_t and stitching source of randomness from 2x uint64_t feels more like a hack than a clean interface.

Originally, I considered extracting common code and shared pool of prefetched randomness to random_bytes() that returns array of random bytes (see 1d76e26c), but that's moot with a change to an array of uint64_t in random().

There is also a ~5% performance hit in both cases, due to the extra copies.

FWIW, I'm fine with small performance loss in exchange for a cleaner interface, but the need to stitch source of randomness from 2x uint64_t in addition to the small performance loss feels like a bad trade-off.

I'm fairly surprised by the performance loss, since the compiler should be able to turn these into register level transfers, but sure, let's just go with what is there now then.

rshriram · 2017-10-05T15:04:45Z

source/common/runtime/runtime_impl.cc

+  static thread_local uint8_t buffered[2048];
+  static thread_local size_t buffered_idx = sizeof(buffered);
+
+  // Prefetch 2048 bytes of randomness. buffered_idx is initialized to sizeof(buffered),


question: can we prefetch more? Any reason for choosing 2k vs an entire page size (4kb) ?
In other words, does the performance improve if we prefetch more? In which case, it might make sense to have this configurable [though it might be overkill] or cache a much bigger chunk (8kb).

I think 2KiB per thread is probably fine but feel free to experiment. Let's definitely not make it configurable right now.

@rshriram I used 2KB based on my experiments with UUID, where there were diminishing returns with using more memory.

As for this PR, generation of 1,000,000,000 uint64_t number takes:

prefetch | time | improvement (uint64_t) | (ms) | (%) ------------------------------- 128 | 9653 | 256 | 6930 | 28% faster <-- this PR right now 512 | 5571 | 19% faster 1024 | 4888 | 12% faster 2048 | 4594 | 6% faster 4096 | 4424 | 4% faster 8192 | 4386 | 1% faster

So yes, prefetching more bytes would make this even faster, but keep in mind that this is microbenchmark and the effect on requests/s would be negligible (...and this is already 90x faster than what we have now).

Sweet! Would you mind sticking this into a comment so that a future Shriram doesn’t come around trying to optimize this?

rshriram · 2017-10-05T15:05:31Z

test/common/runtime/runtime_impl_test.cc

+    results.insert(random.random());
+  }
+
+  EXPECT_EQ(num_of_results, results.size());


Nice! This is cool

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

rshriram · 2017-10-06T13:29:52Z

Would you mind sticking the diminishing perf table into a comment? Above that function.

mattklein123

This LGTM pending action on any remaining comments by @htuch / @rshriram. (I'm ambivalent on the code duplication in this case).

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

#1812 fixed the perf issues with random number generation. We no longer have to rely on monotonically increasing stream IDs for perf reasons. Switch to always using randomly generated stream IDs. Signed-off-by: Shriram Rajagopalan <shriram@us.ibm.com>

* Improve debug logging for authn filter * Lint

Description: Leverages Envoy's AddrAwareSocketOptionImpl to supply v6 socket options where needed. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>

…ranslation (#1812) **Description** Discovered a bug when sending `streaming` request to AWS Bedrock with body mutation set with `serviceTier:default` at route level. The request translation for AWS Bedrock request is incorrect due to the bug and resulting in an error from AWS Bedrock Malformed message structure for AWS Bedrock ``` { ... "messages" : [ { "role" : "user", "content" : "Explain what AWS Bedrock is in one sentence." } ], "serviceTier" : { "type" : "default" } .... } ``` AWS Bedrock expects message content structure [1] ``` { ... "messages" : [ { "role" : "user", "content" : [ { "text" : "Explain what AWS Bedrock is in one sentence." } ], } ], "serviceTier" : { "type" : "default" } } ``` Error response due to malformed request body for AWS Bedrock ``` 'Response code: 400\n' {'Date': 'Sat, 24 Jan 2026 20:59:22 GMT', 'Content-Type': 'application/json', 'Content-Length': '35', 'x-amzn-RequestId': '1140c7eb-bedd-42f7-aeed-657bd6fdfc46', 'x-amzn-ErrorType': 'SerializationException:http://internal.amazon.com/coral/com.amazon.coral.service/', 'Cache-Control': 'proxy-revalidate', 'Connection': 'keep-alive'} {"Message":"Unexpected field type"} ``` ** Root Cause ** After OpenAI request format -> Bedrock request format translation is done, current logic informs BodyMutator to use original request instead of translated request to mutate with forceBodyMutation flag set to true for initial request, this causes the undesired behavior. **My Debugging and Analysis**: in `endpointspec.go`'s ParseBody method return non-nil mutatedBody when it's streaming request and cost is configured which result in non-nil `mutatedBody` at [2] in processor_impl.go's ProcessRequestBody method, `forceBodyMutation` is set to true for initial request at [3] later cause problem in ``` if onRetry && b.originalBody != nil { // On retry, restore the original body first requestBody = b.originalBody } ``` that resets translated body(supposed to Bedrock format) back to original OpenAI format. **My proposed solution**: Remove request body restoration in body mutation. **Disclaimer**: Debug analysis and code change in processor_impl.go were done myself and test cases were added with Claude Code and tweaked manually. **Related Issues/PRs (if applicable)** Related PR: #1600 1: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ConverseStream.html 2: https://github.com/envoyproxy/ai-gateway/blob/main/internal/endpointspec/endpointspec.go#L90 3: https://github.com/envoyproxy/ai-gateway/blob/main/internal/extproc/processor_impl.go#L174-L176 --------- Signed-off-by: Xiaolin Lin <xlin158@bloomberg.net> Signed-off-by: achoo30 <achoo30@bloomberg.net> Co-authored-by: Aaron Choo <achoo30@bloomberg.net> Co-authored-by: Dan Sun <dsun20@bloomberg.net>

perf: optimize random number generation and populate all 64 bits.

1da5724

Before: Random.benchmarkRandom: 55,713 ms After: Random.benchmarkRandom: 800 ms Fixes envoyproxy#1798. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch reviewed Oct 5, 2017

View reviewed changes

rshriram reviewed Oct 5, 2017

View reviewed changes

akonradi mentioned this pull request Oct 5, 2017

Hashing load balancer routes to only one host #1798

Closed

review: use array of uint64_t.

f9c8064

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

mattklein123 reviewed Oct 6, 2017

View reviewed changes

review: add benchmark results for random() and uuid().

bc9fc95

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

htuch approved these changes Oct 9, 2017

View reviewed changes

htuch merged commit 8ddc010 into envoyproxy:master Oct 9, 2017

htuch mentioned this pull request Oct 9, 2017

Implement cookie hashing for v2 API #1766

Merged

rshriram mentioned this pull request Oct 17, 2017

always use randomly generated stream id #1878

Merged

rshriram pushed a commit to rshriram/envoy that referenced this pull request Oct 30, 2018

Improve debug logging for authn filter (envoyproxy#1812)

82a6909

* Improve debug logging for authn filter * Lint

Conversation

PiotrSikora commented Oct 5, 2017

Uh oh!

PiotrSikora commented Oct 5, 2017

Uh oh!

akonradi commented Oct 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PiotrSikora Oct 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rshriram commented Oct 6, 2017

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PiotrSikora Oct 6, 2017 •

edited

Loading