-
Notifications
You must be signed in to change notification settings - Fork 293
[RemoteModels] OpenAI batch Support #9156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # tests/serving/test_async_flow.py
# Conflicts: # tests/serving/test_async_flow.py
update test to use v2modelserver as model_mode and not as None
add simpler test: test_monitoring_with_model_runner_batch_infer
royischoss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey looks good some comments
1d3c600 to
33ef902
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, just need to go over the tests.
One thing we should still think about is how the naive execution mechanism interacts with the asyncio event loop.
davesh0812
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
one comment for followup pr
| try: | ||
| # gather() stops on first exception - fast fail | ||
| return await asyncio.gather(*tasks) | ||
| except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add relevant list of exceptions here
📝 Description
Added support for OpenAI batch processing, including asynchronous batch execution.
Added integration and unit tests.
Depend on #9117
🛠️ Changes Made
Added batch invocation support (sync via
ThreadPoolExecutor, async viaasyncio.gather) with two-level concurrency control: (1) instance-level thread pool (_executor) for sync batches with per-batch semaphore limiting concurrent threads, and (2) class-level global async semaphore (_global_async_semaphore) shared across all instances to enforce total async request limits while per-batch semaphores ensure fair distribution-preventing API rate limit violations and resource monopolization.✅ Checklist
🧪 Testing
🔗 References
🚨 Breaking Changes?
🔍️ Additional Notes