Skip to content

fix: thread bloat and hanging/freeze on windows#3912

Closed
melund wants to merge 2 commits intomainfrom
all-loop
Closed

fix: thread bloat and hanging/freeze on windows#3912
melund wants to merge 2 commits intomainfrom
all-loop

Conversation

@melund
Copy link
Copy Markdown
Contributor

@melund melund commented Jan 8, 2026

This extends the fix in #3909 to also include linux/mac.

I've applied the fix globally (Windows & Linux) and added max_workers=1 to the reused event loop.

  1. It reuses a thread-local event loop on all platforms.
  2. It limits the default executor of that loop to 1 thread.

This solves both the Port Exhaustion on Windows (by reusing the loop/socket) and the Thread Bloat on Linux (by preventing the creation of hundreds of default ThreadPools).

I have no way to really test this on Linux, so I would think this PR should only be merged once someone has used it for some real-word applications. I am simply not familiar enough with event-loops and asyncio to understand if there are any problems

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • Refactor
    • Enhanced async operation handling to reduce resource consumption and improve system performance on Windows and Linux platforms.

✏️ Tip: You can customize this high-level summary in your review settings.

melund added 2 commits January 7, 2026 21:04
Repeated calls to asyncio.run() on Windows create new ProactorEventLoops,
which I think establishes new TCP socket every time. I think this uses up all the ports and and hangs long execution runs.
This patch replaces asyncio.run() with a persistent thread-local event loop
on Windows.
…s=1 to the reuse loop.

It reuses a thread-local event loop on all platforms.
It limits the default executor of that loop to 1 thread.
This solves both the Port Exhaustion on Windows (by reusing the loop/socket) and the Thread Bloat on Linux (by preventing the creation of hundreds of default ThreadPools).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

Modified the async event loop handling in async_run to use thread-local storage, ensuring each thread maintains its own persistent event loop. On first invocation or loop closure, a new event loop is created and configured with a single-worker ThreadPoolExecutor as the default executor.

Changes

Cohort / File(s) Summary
Thread-local Event Loop Management
src/snakemake/common/__init__.py
Introduces thread-local storage for event loop persistence; replaces asyncio.run() with loop.run_until_complete(); adds automatic loop creation on first use or after closure; configures single-worker ThreadPoolExecutor as default executor.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the primary issue being fixed: thread bloat and hanging/freeze on Windows, which aligns with the changeset's main objective of reusing event loops and limiting default thread pool usage.
Description check ✅ Passed The description provides good context about extending the previous fix to all platforms, explains the solution (thread-local loop reuse and max_workers=1), and addresses the motivation. However, both QC checklist items remain unchecked with no explanation of test coverage or documentation updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/snakemake/common/__init__.py (1)

102-107: Update docstring to reflect new behavior.

The docstring states "Attaches to running event loop or creates a new one" but the implementation now always uses a thread-local managed loop rather than attaching to an existing running loop. Consider updating to accurately describe the reuse-via-thread-local approach.

📝 Suggested docstring update
 def async_run(coroutine):
-    """Attaches to running event loop or creates a new one to execute a
-    coroutine.
+    """Executes a coroutine using a reusable thread-local event loop.
+
+    Reuses an existing thread-local loop or creates a new one if none exists
+    or the previous one was closed. This avoids port exhaustion on Windows
+    and thread bloat on Linux from repeatedly creating new event loops.
+
     .. seealso::
          https://github.com/snakemake/snakemake/issues/1105
          https://stackoverflow.com/a/65696398
     """
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0130fc8 and 3a5f6f2.

📒 Files selected for processing (1)
  • src/snakemake/common/__init__.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • src/snakemake/common/__init__.py
🧠 Learnings (1)
📚 Learning: 2026-01-08T11:13:20.523Z
Learnt from: johanneskoester
Repo: snakemake/snakemake PR: 3643
File: src/snakemake/sourcecache.py:474-492
Timestamp: 2026-01-08T11:13:20.523Z
Learning: In Python files under src/snakemake, ensure that the per-run cache_path for source files does not change within a single workflow run, so that HostingProviderFile._hosted_repos can be keyed reliably by repository name. This invariant should be documented and enforced (e.g., by freezing cache_path after initialization, avoiding mutations during a run, and ensuring any caching uses a run-scoped key). This pattern applies to all modules in the Snakemake source cache area where per-run caching and repository hosting are implemented.

Applied to files:

  • src/snakemake/common/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (46)
  • GitHub Check: tests (5, ubuntu-latest, py312)
  • GitHub Check: tests (10, ubuntu-latest, py313)
  • GitHub Check: tests (9, windows-2022, py313)
  • GitHub Check: tests (10, ubuntu-latest, py312)
  • GitHub Check: tests (8, ubuntu-latest, py313)
  • GitHub Check: tests (10, windows-2022, py313)
  • GitHub Check: tests (9, ubuntu-latest, py313)
  • GitHub Check: tests (10, ubuntu-latest, py311)
  • GitHub Check: tests (7, macos-latest, py313)
  • GitHub Check: tests (7, windows-2022, py313)
  • GitHub Check: tests (8, ubuntu-latest, py311)
  • GitHub Check: tests (7, ubuntu-latest, py313)
  • GitHub Check: tests (8, ubuntu-latest, py312)
  • GitHub Check: tests (8, windows-2022, py313)
  • GitHub Check: tests (9, ubuntu-latest, py312)
  • GitHub Check: tests (9, ubuntu-latest, py311)
  • GitHub Check: tests (7, ubuntu-latest, py311)
  • GitHub Check: tests (6, ubuntu-latest, py312)
  • GitHub Check: tests (6, ubuntu-latest, py313)
  • GitHub Check: tests (3, ubuntu-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py312)
  • GitHub Check: tests (7, ubuntu-latest, py312)
  • GitHub Check: tests (6, ubuntu-latest, py311)
  • GitHub Check: tests (3, ubuntu-latest, py312)
  • GitHub Check: tests (6, windows-2022, py313)
  • GitHub Check: tests (5, ubuntu-latest, py313)
  • GitHub Check: tests (5, windows-2022, py313)
  • GitHub Check: tests (4, ubuntu-latest, py311)
  • GitHub Check: tests (3, windows-2022, py313)
  • GitHub Check: tests (2, macos-latest, py313)
  • GitHub Check: tests (4, windows-2022, py313)
  • GitHub Check: tests (4, ubuntu-latest, py313)
  • GitHub Check: tests (5, ubuntu-latest, py311)
  • GitHub Check: tests (4, macos-latest, py313)
  • GitHub Check: tests (2, ubuntu-latest, py311)
  • GitHub Check: tests (4, ubuntu-latest, py312)
  • GitHub Check: tests (2, ubuntu-latest, py312)
  • GitHub Check: tests (2, ubuntu-latest, py313)
  • GitHub Check: tests (1, windows-2022, py313)
  • GitHub Check: tests (3, macos-latest, py313)
  • GitHub Check: tests (3, ubuntu-latest, py311)
  • GitHub Check: tests (1, macos-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py311)
  • GitHub Check: tests (2, windows-2022, py313)
  • GitHub Check: tests (1, ubuntu-latest, py313)
  • GitHub Check: apidocs
🔇 Additional comments (4)
src/snakemake/common/__init__.py (4)

21-22: LGTM!

Both imports are necessary for the new thread-local event loop pattern.


98-99: LGTM!

Thread-local storage is the appropriate pattern for per-thread event loop isolation, preventing cross-thread interference.


114-117: Verify single-threaded executor doesn't bottleneck async workloads on Linux.

The max_workers=1 executor limits concurrency for any loop.run_in_executor(None, ...) calls. While this addresses thread bloat, it could serialize I/O-bound operations that previously ran in parallel via the default executor. Given the PR author noted they couldn't fully test on Linux, verify this doesn't introduce performance regressions for workflows with heavy async I/O.


118-128: LGTM!

Error handling correctly catches RuntimeError from running the loop inside another event loop (e.g., Jupyter) and re-raises with a helpful WorkflowError message. The coroutine.close() call properly cleans up the unawaited coroutine.

@johanneskoester
Copy link
Copy Markdown
Contributor

Closing in favor of #3911. But thanks a lot for exploring all these different solutions! That helped a lot to understand the issue and getting a feeling for the best path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants