Skip to content

perf: optimize random number generation and populate all 64 bits.#1812

Merged
htuch merged 3 commits intoenvoyproxy:masterfrom
PiotrSikora:random
Oct 9, 2017
Merged

perf: optimize random number generation and populate all 64 bits.#1812
htuch merged 3 commits intoenvoyproxy:masterfrom
PiotrSikora:random

Conversation

@PiotrSikora
Copy link
Copy Markdown
Contributor

Before:
Random.benchmarkRandom: 55,713 ms

After:
Random.benchmarkRandom: 800 ms

Fixes #1798.

Before:
Random.benchmarkRandom: 55,713 ms

After:
Random.benchmarkRandom:    800 ms

Fixes envoyproxy#1798.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@PiotrSikora
Copy link
Copy Markdown
Contributor Author

@mattklein123 @htuch @akonradi this supersedes #1799.

@akonradi
Copy link
Copy Markdown
Contributor

akonradi commented Oct 5, 2017

Cool, works for me.

// Consume sizeof(uint64) bytes from the buffer.
ASSERT(buffered_idx + sizeof(uint64_t) <= sizeof(buffered));
uint64_t rand;
std::memcpy(reinterpret_cast<void*>(&rand), &buffered[buffered_idx], sizeof(uint64_t));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is this really more readable or faster than a simple assignment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct way to do it, iirc, doing otherwise is flirting with undefined behavior. The compiler will optimize out the memcpy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make buffered an array of uint64_t, there is absolutely no undefined behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @htuch .. Why can't we totally avoid the copy, and just have a single integer to track the array consumption, and return buffered[idx] ? Is there some compiler issue at play here? And if so, (another uninitiated question), is it bad to have a global (dynamically allocated array) of these integer indices/and arrays, [one per thread], and use the thread specific slot? Cache misses wont be bad as long as the thread stays on the same core.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rshriram a global array of indices could be less efficient since the indices for different threads would be on the same cache line (at least on modern x86) so the cache line would bounce between cores unnecessarily.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please just keep is simple. Use a thread local array of uint64_t which will be properly aligned, then just return the current entry and advance without a local copy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shoot yes.. the index array would end up causing lots of false sharing. But the question of 2k vs 4k vs 8k byte array size remains..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh! Array of uint64_t is indeed a better choice. Fixed.

buffered_idx += sizeof(uint64_t);

return rand;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update uuid to use this function (i.e. factor out the RAND_bytes pool)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, while we are here, it's possible there is a slight perf bump if you used a struct for both thread local variables. I'm not sure of the compiler optimizes this or not or goes through the thread local register multiple times. Not a big deal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moot now, because of the migration to array of uint64_t.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 16 bytes just 2 uint64_t via random()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is, but refactoring uuid() to use random() that returns uint64_t and stitching source of randomness from 2x uint64_t feels more like a hack than a clean interface.

Originally, I considered extracting common code and shared pool of prefetched randomness to random_bytes() that returns array of random bytes (see 1d76e26c), but that's moot with a change to an array of uint64_t in random().

There is also a ~5% performance hit in both cases, due to the extra copies.

FWIW, I'm fine with small performance loss in exchange for a cleaner interface, but the need to stitch source of randomness from 2x uint64_t in addition to the small performance loss feels like a bad trade-off.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly surprised by the performance loss, since the compiler should be able to turn these into register level transfers, but sure, let's just go with what is there now then.

static thread_local uint8_t buffered[2048];
static thread_local size_t buffered_idx = sizeof(buffered);

// Prefetch 2048 bytes of randomness. buffered_idx is initialized to sizeof(buffered),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can we prefetch more? Any reason for choosing 2k vs an entire page size (4kb) ?
In other words, does the performance improve if we prefetch more? In which case, it might make sense to have this configurable [though it might be overkill] or cache a much bigger chunk (8kb).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2KiB per thread is probably fine but feel free to experiment. Let's definitely not make it configurable right now.

Copy link
Copy Markdown
Contributor Author

@PiotrSikora PiotrSikora Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rshriram I used 2KB based on my experiments with UUID, where there were diminishing returns with using more memory.

As for this PR, generation of 1,000,000,000 uint64_t number takes:

 prefetch  | time | improvement
(uint64_t) | (ms) | (%)
-------------------------------
       128 | 9653 |
       256 | 6930 | 28% faster  <-- this PR right now
       512 | 5571 | 19% faster
      1024 | 4888 | 12% faster
      2048 | 4594 |  6% faster
      4096 | 4424 |  4% faster
      8192 | 4386 |  1% faster

So yes, prefetching more bytes would make this even faster, but keep in mind that this is microbenchmark and the effect on requests/s would be negligible (...and this is already 90x faster than what we have now).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Would you mind sticking this into a comment so that a future Shriram doesn’t come around trying to optimize this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

results.insert(random.random());
}

EXPECT_EQ(num_of_results, results.size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is cool

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@rshriram
Copy link
Copy Markdown
Member

rshriram commented Oct 6, 2017

Would you mind sticking the diminishing perf table into a comment? Above that function.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM pending action on any remaining comments by @htuch / @rshriram. (I'm ambivalent on the code duplication in this case).

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@htuch htuch merged commit 8ddc010 into envoyproxy:master Oct 9, 2017
htuch pushed a commit that referenced this pull request Oct 17, 2017
#1812 fixed the perf issues with random number generation. We no longer have to rely on monotonically increasing stream IDs for perf reasons. Switch to always using randomly generated stream IDs.

Signed-off-by: Shriram Rajagopalan <shriram@us.ibm.com>
rshriram pushed a commit to rshriram/envoy that referenced this pull request Oct 30, 2018
* Improve debug logging for authn filter

* Lint
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: Leverages Envoy's AddrAwareSocketOptionImpl to supply v6 socket options where needed.
Risk Level: Low
Testing: CI

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: Leverages Envoy's AddrAwareSocketOptionImpl to supply v6 socket options where needed.
Risk Level: Low
Testing: CI

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
mathetake pushed a commit that referenced this pull request Mar 3, 2026
…ranslation (#1812)

**Description**

Discovered a bug when sending `streaming` request to AWS Bedrock with
body mutation set with `serviceTier:default` at route level. The request
translation for AWS Bedrock request is incorrect due to the bug and
resulting in an error from AWS Bedrock

Malformed message structure for AWS Bedrock
```
{
  ...
  "messages" : [ 
    {
      "role" : "user",
      "content" : "Explain what AWS Bedrock is in one sentence."
    } 
  ],
  "serviceTier" : {
    "type" : "default"
  }
  ....
}
```

AWS Bedrock expects message content structure [1]
```
{
 ...
  "messages" : [ {
    "role" : "user",
    "content" : [ {
      "text" : "Explain what AWS Bedrock is in one sentence."
    } ],
  } ],
  "serviceTier" : {
    "type" : "default"
  }
}
```

Error response due to malformed request body for AWS Bedrock
```
'Response code: 400\n'
{'Date': 'Sat, 24 Jan 2026 20:59:22 GMT', 'Content-Type': 'application/json', 'Content-Length': '35', 'x-amzn-RequestId': '1140c7eb-bedd-42f7-aeed-657bd6fdfc46', 'x-amzn-ErrorType': 'SerializationException:http://internal.amazon.com/coral/com.amazon.coral.service/', 'Cache-Control': 'proxy-revalidate', 'Connection': 'keep-alive'}

{"Message":"Unexpected field type"}
```

** Root Cause **
After OpenAI request format -> Bedrock request format translation is
done, current logic informs BodyMutator to use original request instead
of translated request to mutate with forceBodyMutation flag set to true
for initial request, this causes the undesired behavior.

**My Debugging and Analysis**:

in `endpointspec.go`'s ParseBody method return non-nil mutatedBody when
it's streaming request and cost is configured which result in non-nil
`mutatedBody` at [2]

in processor_impl.go's ProcessRequestBody method, `forceBodyMutation` is
set to true for initial request at [3] later cause problem in
```
if onRetry && b.originalBody != nil {
	// On retry, restore the original body first
	requestBody = b.originalBody
}
```
that resets translated body(supposed to Bedrock format) back to original
OpenAI format.

**My proposed solution**:
Remove request body restoration in body mutation.

**Disclaimer**:
Debug analysis and code change in processor_impl.go were done myself and
test cases were added with Claude Code and tweaked manually.

**Related Issues/PRs (if applicable)**
Related PR: #1600

1:
https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ConverseStream.html
2:
https://github.com/envoyproxy/ai-gateway/blob/main/internal/endpointspec/endpointspec.go#L90
3:
https://github.com/envoyproxy/ai-gateway/blob/main/internal/extproc/processor_impl.go#L174-L176

---------

Signed-off-by: Xiaolin Lin <xlin158@bloomberg.net>
Signed-off-by: achoo30 <achoo30@bloomberg.net>
Co-authored-by: Aaron Choo <achoo30@bloomberg.net>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants