perf: optimize random number generation and populate all 64 bits.#1812
perf: optimize random number generation and populate all 64 bits.#1812htuch merged 3 commits intoenvoyproxy:masterfrom
Conversation
Before: Random.benchmarkRandom: 55,713 ms After: Random.benchmarkRandom: 800 ms Fixes envoyproxy#1798. Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
@mattklein123 @htuch @akonradi this supersedes #1799. |
|
Cool, works for me. |
| // Consume sizeof(uint64) bytes from the buffer. | ||
| ASSERT(buffered_idx + sizeof(uint64_t) <= sizeof(buffered)); | ||
| uint64_t rand; | ||
| std::memcpy(reinterpret_cast<void*>(&rand), &buffered[buffered_idx], sizeof(uint64_t)); |
There was a problem hiding this comment.
Nit: is this really more readable or faster than a simple assignment?
There was a problem hiding this comment.
This is the correct way to do it, iirc, doing otherwise is flirting with undefined behavior. The compiler will optimize out the memcpy.
There was a problem hiding this comment.
If you make buffered an array of uint64_t, there is absolutely no undefined behavior.
There was a problem hiding this comment.
I agree with @htuch .. Why can't we totally avoid the copy, and just have a single integer to track the array consumption, and return buffered[idx] ? Is there some compiler issue at play here? And if so, (another uninitiated question), is it bad to have a global (dynamically allocated array) of these integer indices/and arrays, [one per thread], and use the thread specific slot? Cache misses wont be bad as long as the thread stays on the same core.
There was a problem hiding this comment.
@rshriram a global array of indices could be less efficient since the indices for different threads would be on the same cache line (at least on modern x86) so the cache line would bounce between cores unnecessarily.
There was a problem hiding this comment.
Yes, please just keep is simple. Use a thread local array of uint64_t which will be properly aligned, then just return the current entry and advance without a local copy.
There was a problem hiding this comment.
Ah shoot yes.. the index array would end up causing lots of false sharing. But the question of 2k vs 4k vs 8k byte array size remains..
There was a problem hiding this comment.
Duh! Array of uint64_t is indeed a better choice. Fixed.
| buffered_idx += sizeof(uint64_t); | ||
|
|
||
| return rand; | ||
| } |
There was a problem hiding this comment.
Can we update uuid to use this function (i.e. factor out the RAND_bytes pool)?
There was a problem hiding this comment.
+1, while we are here, it's possible there is a slight perf bump if you used a struct for both thread local variables. I'm not sure of the compiler optimizes this or not or goes through the thread local register multiple times. Not a big deal.
There was a problem hiding this comment.
This is moot now, because of the migration to array of uint64_t.
There was a problem hiding this comment.
Isn't 16 bytes just 2 uint64_t via random()?
There was a problem hiding this comment.
Yes, it is, but refactoring uuid() to use random() that returns uint64_t and stitching source of randomness from 2x uint64_t feels more like a hack than a clean interface.
Originally, I considered extracting common code and shared pool of prefetched randomness to random_bytes() that returns array of random bytes (see 1d76e26c), but that's moot with a change to an array of uint64_t in random().
There is also a ~5% performance hit in both cases, due to the extra copies.
FWIW, I'm fine with small performance loss in exchange for a cleaner interface, but the need to stitch source of randomness from 2x uint64_t in addition to the small performance loss feels like a bad trade-off.
There was a problem hiding this comment.
I'm fairly surprised by the performance loss, since the compiler should be able to turn these into register level transfers, but sure, let's just go with what is there now then.
| static thread_local uint8_t buffered[2048]; | ||
| static thread_local size_t buffered_idx = sizeof(buffered); | ||
|
|
||
| // Prefetch 2048 bytes of randomness. buffered_idx is initialized to sizeof(buffered), |
There was a problem hiding this comment.
question: can we prefetch more? Any reason for choosing 2k vs an entire page size (4kb) ?
In other words, does the performance improve if we prefetch more? In which case, it might make sense to have this configurable [though it might be overkill] or cache a much bigger chunk (8kb).
There was a problem hiding this comment.
I think 2KiB per thread is probably fine but feel free to experiment. Let's definitely not make it configurable right now.
There was a problem hiding this comment.
@rshriram I used 2KB based on my experiments with UUID, where there were diminishing returns with using more memory.
As for this PR, generation of 1,000,000,000 uint64_t number takes:
prefetch | time | improvement
(uint64_t) | (ms) | (%)
-------------------------------
128 | 9653 |
256 | 6930 | 28% faster <-- this PR right now
512 | 5571 | 19% faster
1024 | 4888 | 12% faster
2048 | 4594 | 6% faster
4096 | 4424 | 4% faster
8192 | 4386 | 1% faster
So yes, prefetching more bytes would make this even faster, but keep in mind that this is microbenchmark and the effect on requests/s would be negligible (...and this is already 90x faster than what we have now).
There was a problem hiding this comment.
Sweet! Would you mind sticking this into a comment so that a future Shriram doesn’t come around trying to optimize this?
| results.insert(random.random()); | ||
| } | ||
|
|
||
| EXPECT_EQ(num_of_results, results.size()); |
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
Would you mind sticking the diminishing perf table into a comment? Above that function. |
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
#1812 fixed the perf issues with random number generation. We no longer have to rely on monotonically increasing stream IDs for perf reasons. Switch to always using randomly generated stream IDs. Signed-off-by: Shriram Rajagopalan <shriram@us.ibm.com>
* Improve debug logging for authn filter * Lint
Description: Leverages Envoy's AddrAwareSocketOptionImpl to supply v6 socket options where needed. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: Leverages Envoy's AddrAwareSocketOptionImpl to supply v6 socket options where needed. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
…ranslation (#1812) **Description** Discovered a bug when sending `streaming` request to AWS Bedrock with body mutation set with `serviceTier:default` at route level. The request translation for AWS Bedrock request is incorrect due to the bug and resulting in an error from AWS Bedrock Malformed message structure for AWS Bedrock ``` { ... "messages" : [ { "role" : "user", "content" : "Explain what AWS Bedrock is in one sentence." } ], "serviceTier" : { "type" : "default" } .... } ``` AWS Bedrock expects message content structure [1] ``` { ... "messages" : [ { "role" : "user", "content" : [ { "text" : "Explain what AWS Bedrock is in one sentence." } ], } ], "serviceTier" : { "type" : "default" } } ``` Error response due to malformed request body for AWS Bedrock ``` 'Response code: 400\n' {'Date': 'Sat, 24 Jan 2026 20:59:22 GMT', 'Content-Type': 'application/json', 'Content-Length': '35', 'x-amzn-RequestId': '1140c7eb-bedd-42f7-aeed-657bd6fdfc46', 'x-amzn-ErrorType': 'SerializationException:http://internal.amazon.com/coral/com.amazon.coral.service/', 'Cache-Control': 'proxy-revalidate', 'Connection': 'keep-alive'} {"Message":"Unexpected field type"} ``` ** Root Cause ** After OpenAI request format -> Bedrock request format translation is done, current logic informs BodyMutator to use original request instead of translated request to mutate with forceBodyMutation flag set to true for initial request, this causes the undesired behavior. **My Debugging and Analysis**: in `endpointspec.go`'s ParseBody method return non-nil mutatedBody when it's streaming request and cost is configured which result in non-nil `mutatedBody` at [2] in processor_impl.go's ProcessRequestBody method, `forceBodyMutation` is set to true for initial request at [3] later cause problem in ``` if onRetry && b.originalBody != nil { // On retry, restore the original body first requestBody = b.originalBody } ``` that resets translated body(supposed to Bedrock format) back to original OpenAI format. **My proposed solution**: Remove request body restoration in body mutation. **Disclaimer**: Debug analysis and code change in processor_impl.go were done myself and test cases were added with Claude Code and tweaked manually. **Related Issues/PRs (if applicable)** Related PR: #1600 1: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ConverseStream.html 2: https://github.com/envoyproxy/ai-gateway/blob/main/internal/endpointspec/endpointspec.go#L90 3: https://github.com/envoyproxy/ai-gateway/blob/main/internal/extproc/processor_impl.go#L174-L176 --------- Signed-off-by: Xiaolin Lin <xlin158@bloomberg.net> Signed-off-by: achoo30 <achoo30@bloomberg.net> Co-authored-by: Aaron Choo <achoo30@bloomberg.net> Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Before:
Random.benchmarkRandom: 55,713 ms
After:
Random.benchmarkRandom: 800 ms
Fixes #1798.