Skip to content

[smg][ci] Add thread safety to ModelPool and GPUAllocator#16674

Merged
slin1237 merged 12 commits intomainfrom
smg-ci-n/20
Jan 7, 2026
Merged

[smg][ci] Add thread safety to ModelPool and GPUAllocator#16674
slin1237 merged 12 commits intomainfrom
smg-ci-n/20

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Jan 7, 2026

GPUAllocator:

  • Add RLock to protect _used_gpus and slots
  • Wrap allocate_slots, release_gpus, available_gpus, get_slot_for_model, summary

ModelPool:

  • Add RLock to protect instances dict
  • Wrap startup, get, get_workers_by_type, launch_workers, shutdown
  • Split public methods into wrapper + unlocked internal methods

This enables safe parallel test execution where multiple tests can concurrently access the model pool without race conditions.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the system by implementing comprehensive thread safety mechanisms within the GPUAllocator and ModelPool components. The primary goal is to eliminate potential race conditions that could arise from concurrent access, particularly in multi-threaded environments like parallel test execution. By protecting critical shared data structures and operations with reentrant locks, the changes ensure data integrity and predictable behavior, paving the way for more reliable and efficient testing workflows.

Highlights

  • GPUAllocator Thread Safety: Introduced a threading.RLock to the GPUAllocator class to protect shared resources like _used_gpus and slots. Key methods such as allocate_slots, release_gpus, available_gpus, get_slot_for_model, and summary are now wrapped with this lock to ensure thread-safe operations.
  • ModelPool Thread Safety: Added a threading.RLock to the ModelPool class to safeguard the instances dictionary. Public methods including startup, get, get_workers_by_type, launch_workers, and shutdown have been updated to acquire this lock, with their core logic moved into new _unlocked internal methods.
  • Enabled Parallel Test Execution: The implemented thread safety measures in both GPUAllocator and ModelPool are designed to prevent race conditions, thereby enabling safe and concurrent access to the model pool during parallel test execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@slin1237 slin1237 added the run-ci label Jan 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces thread safety to ModelPool and GPUAllocator by using threading.RLock to protect shared state. The approach of refactoring public methods into a locked wrapper and an unlocked internal method is sound and consistently applied. The changes are generally well-implemented. However, I've identified a potential race condition in get() and get_workers_by_type() concerning instance acquisition. An instance returned by these methods could be evicted by another thread before the caller can acquire it. I have provided specific comments and suggestions to address this by ensuring instances are acquired while the pool lock is held, which is crucial for robust thread safety.

Comment thread sgl-model-gateway/e2e_test/infra/model_pool.py Outdated
Comment thread sgl-model-gateway/e2e_test/infra/model_pool.py
GPUAllocator:
- Add RLock to protect _used_gpus and slots
- Wrap allocate_slots, release_gpus, available_gpus, get_slot_for_model, summary

ModelPool:
- Add RLock to protect instances dict
- Wrap startup, get, get_workers_by_type, launch_workers, shutdown
- Split public methods into wrapper + unlocked internal methods

This enables safe parallel test execution where multiple tests can
concurrently access the model pool without race conditions.
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Jan 7, 2026
When pytest-parallel runs multiple test threads, they would all
simultaneously try to create their own model_pool because the check
`if _model_pool is not None` was not thread-safe.

This caused CUDA OOM errors as multiple workers tried to allocate
the same GPUs.

Fix: Add threading.Lock() around the entire fixture initialization
to ensure only one thread creates the pool while others wait.

Fix pytest-parallel compatibility: add py dependency

pytest-parallel requires the 'py' package for py.log.Producer, but
newer pytest versions (7.2.0+) no longer include it as a dependency.

References:
- pytest-dev/pytest#10424
- kevlened/pytest-parallel#118

Add parallel test execution support with pytest-parallel

Enable thread-based parallelism for e2e tests using pytest-parallel.
Tests run concurrently as threads within a single process, sharing
the session-scoped model_pool fixture.

Infrastructure changes:
- Add pytest-parallel dependency to e2e_test/pyproject.toml
- Add thread_unsafe marker for incompatible tests
- Add is_parallel_execution() helper and pytest_runtest_setup hook
  to auto-skip thread_unsafe tests in parallel mode
- Update conftest.py with parallel execution documentation

Workflow changes:
- Add parallel_opts matrix variable to pr-test-rust.yml
- Enable --workers 1 --tests-per-worker 4 for e2e test suite
- Keep benchmarks and legacy tests sequential

Bugfix:
- Fix NameError in _setup_pd_backend: use 'prefills + decodes'
  instead of undefined 'all_workers' in finally block
With pytest-parallel running multiple threads, request.addfinalizer
was firing prematurely when one thread's test finished, shutting down
the model pool while other threads were still using it.

Fix: Use atexit.register() instead of request.addfinalizer() to ensure
the pool shutdown only happens at process exit, after all tests complete.
The pytest-parallel plugin forks a worker process after test collection.
If CUDA is initialized before the fork (during collection), all CUDA
operations in the forked worker fail with "Cannot re-initialize CUDA".

Fix: Use nvidia-smi instead of torch.cuda for GPU counting in
validate_gpu_requirements(). This avoids CUDA initialization during
the collection phase.
The class-scoped fixture was causing "Cannot copy out of meta tensor"
errors when multiple threads tried to load SentenceTransformer
simultaneously. Changed to session-scoped with thread-safe initialization
using a lock and global cache, matching the pattern used for model_pool.
When tests run in parallel and all GPUs are occupied by workers that are
in use by other tests, launch_workers now waits (up to 5 min) for GPUs
to become available instead of failing immediately.

The key changes:
- _launch_workers_unlocked now returns None when GPUs unavailable after
  eviction (all workers in use), signaling retry is needed
- launch_workers has a polling loop that releases the lock while waiting,
  allowing other tests to release their workers

This fixes PD tests being skipped when run in parallel with other tests
that temporarily hold all available GPUs.
@slin1237
Copy link
Copy Markdown
Collaborator Author

slin1237 commented Jan 7, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for parallel test execution using pytest-parallel by making the ModelPool and GPUAllocator classes thread-safe. Key changes include adding threading.Lock or threading.RLock to ModelPool and GPUAllocator methods to protect shared state, modifying fixtures like hf_reference_embeddings and model_pool to be session-scoped and thread-safe, and updating GPU detection to use nvidia-smi to avoid CUDA initialization issues in parallel environments. A new thread_unsafe pytest marker was added to skip tests incompatible with parallel execution, and model_pool.get() and get_workers_by_type() now auto-acquire model instances. The reviewer suggested refactoring the is_parallel_execution function to use the pytest.Config object instead of parsing sys.argv for robustness, and updating its call site accordingly.

Comment thread sgl-model-gateway/e2e_test/fixtures/hooks.py Outdated
Comment thread sgl-model-gateway/e2e_test/fixtures/hooks.py Outdated
- Add logging at start of _setup_pd_backend to track execution
- Add pytest.fail() when launch_workers returns empty list
- Add guard for incomplete PD setup (missing prefill or decode workers)
- Better error messages for debugging GPU allocation failures
Add logging to every early check in _setup_pd_backend to diagnose
why PD tests are being skipped. Now logs:
- sgl_kernel import success/failure
- torch import success/failure
- CUDA availability
- GPU count vs requirements

This will show exactly which check is causing the skip.
PD workers are no different from embedding workers - they just need
to be launched. The early checks (sgl_kernel, torch, cuda, gpu count)
were unnecessary and causing skips.

Embedding workers work without these checks - they just get launched
on-demand and fail properly if something's wrong. PD should work the
same way.
Use pytest's config.option instead of parsing sys.argv, which is
fragile and doesn't handle options from pytest.ini/pyproject.toml.
Make get() wait for GPUs to become available instead of failing
immediately when all GPUs are in use by other tests:

- Add wait_for_gpus and gpu_wait_timeout parameters to get()
- Change _get_unlocked() to return None when GPUs unavailable (signals retry)
- Change _ensure_gpu_available() to return bool instead of raising
- Add polling loop in get() that releases lock while waiting

This matches the existing wait mechanism in launch_workers() and allows
parallel tests to share GPUs by waiting for each other.
Include %(threadName)s in log format so logs from different parallel
test threads can be distinguished. Shows 'MainThread' for sequential
execution and 'Thread-N' for parallel workers.
@slin1237 slin1237 merged commit 6037267 into main Jan 7, 2026
88 of 90 checks passed
@slin1237 slin1237 deleted the smg-ci-n/20 branch January 7, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file model-gateway run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant