[smg][ci] Add thread safety to ModelPool and GPUAllocator#16674
[smg][ci] Add thread safety to ModelPool and GPUAllocator#16674
Conversation
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the system by implementing comprehensive thread safety mechanisms within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces thread safety to ModelPool and GPUAllocator by using threading.RLock to protect shared state. The approach of refactoring public methods into a locked wrapper and an unlocked internal method is sound and consistently applied. The changes are generally well-implemented. However, I've identified a potential race condition in get() and get_workers_by_type() concerning instance acquisition. An instance returned by these methods could be evicted by another thread before the caller can acquire it. I have provided specific comments and suggestions to address this by ensuring instances are acquired while the pool lock is held, which is crucial for robust thread safety.
GPUAllocator: - Add RLock to protect _used_gpus and slots - Wrap allocate_slots, release_gpus, available_gpus, get_slot_for_model, summary ModelPool: - Add RLock to protect instances dict - Wrap startup, get, get_workers_by_type, launch_workers, shutdown - Split public methods into wrapper + unlocked internal methods This enables safe parallel test execution where multiple tests can concurrently access the model pool without race conditions.
When pytest-parallel runs multiple test threads, they would all simultaneously try to create their own model_pool because the check `if _model_pool is not None` was not thread-safe. This caused CUDA OOM errors as multiple workers tried to allocate the same GPUs. Fix: Add threading.Lock() around the entire fixture initialization to ensure only one thread creates the pool while others wait. Fix pytest-parallel compatibility: add py dependency pytest-parallel requires the 'py' package for py.log.Producer, but newer pytest versions (7.2.0+) no longer include it as a dependency. References: - pytest-dev/pytest#10424 - kevlened/pytest-parallel#118 Add parallel test execution support with pytest-parallel Enable thread-based parallelism for e2e tests using pytest-parallel. Tests run concurrently as threads within a single process, sharing the session-scoped model_pool fixture. Infrastructure changes: - Add pytest-parallel dependency to e2e_test/pyproject.toml - Add thread_unsafe marker for incompatible tests - Add is_parallel_execution() helper and pytest_runtest_setup hook to auto-skip thread_unsafe tests in parallel mode - Update conftest.py with parallel execution documentation Workflow changes: - Add parallel_opts matrix variable to pr-test-rust.yml - Enable --workers 1 --tests-per-worker 4 for e2e test suite - Keep benchmarks and legacy tests sequential Bugfix: - Fix NameError in _setup_pd_backend: use 'prefills + decodes' instead of undefined 'all_workers' in finally block
With pytest-parallel running multiple threads, request.addfinalizer was firing prematurely when one thread's test finished, shutting down the model pool while other threads were still using it. Fix: Use atexit.register() instead of request.addfinalizer() to ensure the pool shutdown only happens at process exit, after all tests complete.
The pytest-parallel plugin forks a worker process after test collection. If CUDA is initialized before the fork (during collection), all CUDA operations in the forked worker fail with "Cannot re-initialize CUDA". Fix: Use nvidia-smi instead of torch.cuda for GPU counting in validate_gpu_requirements(). This avoids CUDA initialization during the collection phase.
The class-scoped fixture was causing "Cannot copy out of meta tensor" errors when multiple threads tried to load SentenceTransformer simultaneously. Changed to session-scoped with thread-safe initialization using a lock and global cache, matching the pattern used for model_pool.
When tests run in parallel and all GPUs are occupied by workers that are in use by other tests, launch_workers now waits (up to 5 min) for GPUs to become available instead of failing immediately. The key changes: - _launch_workers_unlocked now returns None when GPUs unavailable after eviction (all workers in use), signaling retry is needed - launch_workers has a polling loop that releases the lock while waiting, allowing other tests to release their workers This fixes PD tests being skipped when run in parallel with other tests that temporarily hold all available GPUs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for parallel test execution using pytest-parallel by making the ModelPool and GPUAllocator classes thread-safe. Key changes include adding threading.Lock or threading.RLock to ModelPool and GPUAllocator methods to protect shared state, modifying fixtures like hf_reference_embeddings and model_pool to be session-scoped and thread-safe, and updating GPU detection to use nvidia-smi to avoid CUDA initialization issues in parallel environments. A new thread_unsafe pytest marker was added to skip tests incompatible with parallel execution, and model_pool.get() and get_workers_by_type() now auto-acquire model instances. The reviewer suggested refactoring the is_parallel_execution function to use the pytest.Config object instead of parsing sys.argv for robustness, and updating its call site accordingly.
- Add logging at start of _setup_pd_backend to track execution - Add pytest.fail() when launch_workers returns empty list - Add guard for incomplete PD setup (missing prefill or decode workers) - Better error messages for debugging GPU allocation failures
Add logging to every early check in _setup_pd_backend to diagnose why PD tests are being skipped. Now logs: - sgl_kernel import success/failure - torch import success/failure - CUDA availability - GPU count vs requirements This will show exactly which check is causing the skip.
PD workers are no different from embedding workers - they just need to be launched. The early checks (sgl_kernel, torch, cuda, gpu count) were unnecessary and causing skips. Embedding workers work without these checks - they just get launched on-demand and fail properly if something's wrong. PD should work the same way.
Use pytest's config.option instead of parsing sys.argv, which is fragile and doesn't handle options from pytest.ini/pyproject.toml.
Make get() wait for GPUs to become available instead of failing immediately when all GPUs are in use by other tests: - Add wait_for_gpus and gpu_wait_timeout parameters to get() - Change _get_unlocked() to return None when GPUs unavailable (signals retry) - Change _ensure_gpu_available() to return bool instead of raising - Add polling loop in get() that releases lock while waiting This matches the existing wait mechanism in launch_workers() and allows parallel tests to share GPUs by waiting for each other.
Include %(threadName)s in log format so logs from different parallel test threads can be distinguished. Shows 'MainThread' for sequential execution and 'Thread-N' for parallel workers.
GPUAllocator:
ModelPool:
This enables safe parallel test execution where multiple tests can concurrently access the model pool without race conditions.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.