[serve] Fix buffered logging reusing request context (Fixes #55851)#56094
[serve] Fix buffered logging reusing request context (Fixes #55851)#56094zcin merged 9 commits intoray-project:masterfrom
Conversation
…ct#55851) Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of buffered logs reusing request contexts by capturing the context at emit time. The approach of wrapping logger methods for buffered logs is sound, and making ServeContextFilter idempotent is a necessary change. My feedback includes a couple of suggestions to improve the robustness of the new wrap_logger_for_buffering function to prevent potential side effects and make its signature more explicit.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com>
|
Thanks for the analysis on this. When i compare:
I can conclude that it's better to remove MemoryHandler than to apply this fix. The reason why MemoryHandler was added was to improve performance. Can you think of a more optimal solution, if not I suggest we drop the memory handle. @akyang-anyscale, any other ideas? |
|
I tried running these tests 4-5 times more. I found the following stats.
For different runs the latency slightly differs so these were the best figures after running it 4-5 times.
I also have a couple of questions here
|
|
I have analyzed adding I have removed adding the filter to the file handler I feel this method is more robust since we are not explicitly adding wrappers around specific logging levels. The latency comparison is shown below: |
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
|
Fixed the code with the changes and benchmarked the results here. |
|
Any explanation for why Let's add a test in |
|
The performance of buffer+original and buffer+fix is almost similar. Sometimes one performs better than the other but they are very close. I don't see a concrete reason behind the difference because they are effectively doing the same thing but in a different order. For the test case, I am thinking of implementing a small deployment with logging and sending requests(>buffer_Size) with buffering enabled. I will then count the occurrences of replica_ids in the logs(system and application) generated? |
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
|
Added a test which checks reuse of request ids for buffering case |
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
|
Added (request_id, message) pairs for counting. This takes care of uniqueness as there are only three unique messages which are different , if request ids repeat this will catch any duplicates. |
added `wait_for_condition` before checking logs Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com>
|
|
||
| logs_dir = get_serve_logs_dir() | ||
|
|
||
| def check_logs(): |
There was a problem hiding this comment.
Yes, re committing this
Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
abrarsheikh
left a comment
There was a problem hiding this comment.
thank you. You just fixed a high priority bug for us.
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: zac <zac@anyscale.com>
…56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes #55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
…ct#55851) (ray-project#56094) ## Why are these changes needed? Currently, when Serve file logs are buffered via a `MemoryHandler`, `ServeContextFilter` fetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing. This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes ray-project#55851 ## Performance Testing Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix. Performance- Used Locust with 100 users for a duration of 3-4 mins Without buffering: With fix: `Avg: 396.69(ms), P99: 580(ms), RPS: 228.4` Without fix: `391.29(ms), P99: 560(ms), RPS: 239` With buffering: set `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE` = 1000 With fix: `Avg(ms): 400.83, P99(ms): 620, RPS: 230.5` Without fix: `Avg(ms): 373.25, P99(ms): 610, RPS: 249.4` <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Signed-off-by: Vaishnavi Panchavati <38342947+vaishdho1@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
Why are these changes needed?
Currently, when Serve file logs are buffered via a
MemoryHandler,ServeContextFilterfetches the serve request context at flush time instead of when the log record is emitted. As a result, many log records flushed together can share the same request context, breaking per request tracing.This PR captures the request context at emit time when buffering is enabled and makes the filter idempotent so it won’t overwrite pre populated fields. This preserves correct per record context for buffered file logs without changing non buffered behavior.
Related issue number
Closes #55851
Performance Testing
Manual Verification - Benchmarked both buffered and non buffered cases with and without the fix.
Performance- Used Locust with 100 users for a duration of 3-4 mins
Without buffering:
With fix:
Avg: 396.69(ms), P99: 580(ms), RPS: 228.4Without fix:
391.29(ms), P99: 560(ms), RPS: 239With buffering:
set
RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE= 1000With fix:
Avg(ms): 400.83, P99(ms): 620, RPS: 230.5Without fix:
Avg(ms): 373.25, P99(ms): 610, RPS: 249.4Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.