Skip to content

fix: Windows hang/freeze due to asyncio port exhaustion#3909

Closed
melund wants to merge 1 commit intomainfrom
win-loop
Closed

fix: Windows hang/freeze due to asyncio port exhaustion#3909
melund wants to merge 1 commit intomainfrom
win-loop

Conversation

@melund
Copy link
Copy Markdown
Contributor

@melund melund commented Jan 7, 2026

This PR tries to fixes a issue on Windows where Snakemake hangs or freezes after running for an extended period (typically after > 1-6 hour or with many jobs).

The Issue:
Snakemake's scheduler frequently calls async_run (wrapping asyncio.run) to check job status.

Here is py-spy dump from a one of my runs that stalled after many hours.

Python v3.13.9 (.pixi\envs\default\python.exe)

Thread 135856 (idle): "MainThread"
    accept (socket.py:295)
    _fallback_socketpair (socket.py:627)
    _make_self_pipe (asyncio\proactor_events.py:786)
    __init__ (asyncio\proactor_events.py:639)
    __init__ (asyncio\windows_events.py:316)
    new_event_loop (asyncio\events.py:734)
    new_event_loop (asyncio\events.py:837)
    _lazy_init (asyncio\runners.py:137)
    __enter__ (asyncio\runners.py:58)
    run (asyncio\runners.py:194)
    async_run (snakemake\common\__init__.py:104)
    job_selector (snakemake\scheduling\job_scheduler.py:582)
   .....
Thread 137516 (idle): "ThreadPoolExecutor-1_0"
    _worker (concurrent\futures\thread.py:90)
    run (threading.py:994)
    _bootstrap_inner (threading.py:1043)
    _bootstrap (threading.py:1014)
Thread 45508 (idle): "ThreadPoolExecutor-1_1"
    _worker (concurrent\futures\thread.py:90)
    run (threading.py:994)
    _bootstrap_inner (threading.py:1043)
    _bootstrap (threading.py:1014)
....

I guess that creating and destroying thousands of these loops exhausts the availableTCP ports, or something similar. Once ports are exhausted, socket.socketpair() hangs indefinitely inside accept(), freezing the Snakemake process.

The Fix:
I modified async_run in __init__.py to reuse the asyncio event loop (one per every local thread) when running on Windows. Similar to how a long-running application would manage its loop. I am hoping this prevents the constant teardown and recreation of socket pairs.

--

I am also running some tests where I just try to change the type of event loop directly in the snakefile.

# Workaround for random hangs on Windows with ProactorEventLoop
if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

Maybe this could also be done directly in snakemake, and maybe that would be be better than reusing the eventloop that this PR proposes. I am not really an expert on this. So any suggestions would be greatly appreciated.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • Bug Fixes
    • Improved Windows reliability when running multiple workflows in sequence to prevent port exhaustion and reduce unexpected failures.
    • Enhanced error handling on Windows so workflow execution failures are surfaced consistently, improving stability and diagnostics.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Adds Windows-specific event-loop handling to async_run: introduces a module-level thread-local (_thread_local) that lazily creates and reuses a per-thread event loop and uses run_until_complete with error handling on Windows; non-Windows path continues to call asyncio.run(coroutine).

Changes

Cohort / File(s) Summary
Windows asyncio event-loop reuse
src/snakemake/common/__init__.py
Adds a module-level thread-local _thread_local and updates async_run to lazily create/reuse a per-thread event loop on Windows, call run_until_complete, close the coroutine and raise WorkflowError on RuntimeError. Keeps non-Windows asyncio.run(coroutine) path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: addressing Windows hangs caused by asyncio port exhaustion through event loop reuse.
Description check ✅ Passed The pull request description is comprehensive and well-structured, clearly explaining the Windows-specific issue, the root cause hypothesis, and the proposed solution with appropriate context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 7, 2026

Please format your code with pixi run format

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 7, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@melund melund changed the title Fix Windows hang/freeze due to asyncio port exhaustion fix: Windows hang/freeze due to asyncio port exhaustion Jan 7, 2026
@melund melund marked this pull request as ready for review January 7, 2026 19:51
@melund melund marked this pull request as draft January 7, 2026 19:55
Repeated calls to asyncio.run() on Windows create new ProactorEventLoops,
which I think establishes new TCP socket every time. I think this uses up all the ports and and hangs long execution runs.
This patch replaces asyncio.run() with a persistent thread-local event loop
on Windows.
@melund melund marked this pull request as ready for review January 7, 2026 20:39
@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 7, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/snakemake/common/__init__.py (1)

120-122: Consider using explicit exception chaining for better tracebacks.

The static analysis correctly notes that raise ... from e provides clearer tracebacks. However, the existing non-Windows path (lines 126-134) uses the same pattern of passing e as a constructor argument to WorkflowError, so this appears intentional for consistency with how WorkflowError is designed.

If you do want to improve this, both paths could be updated:

♻️ Optional: Use explicit exception chaining
         try:
             return _thread_local.loop.run_until_complete(coroutine)
         except RuntimeError as e:
             coroutine.close()
-            raise WorkflowError("Error running coroutine in event loop.", e)
+            raise WorkflowError("Error running coroutine in event loop.") from e
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7172fc9 and e75e75d.

📒 Files selected for processing (1)
  • src/snakemake/common/__init__.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • src/snakemake/common/__init__.py
🪛 Ruff (0.14.10)
src/snakemake/common/__init__.py

122-122: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


122-122: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
src/snakemake/common/__init__.py (2)

96-100: LGTM — Conditional import and thread-local setup is appropriate.

Using threading.local() at module level ensures each thread gets its own event loop, which correctly isolates concurrent execution contexts while avoiding the repeated loop creation/teardown that causes port exhaustion.


109-119: Sound approach for mitigating Windows port exhaustion.

The implementation correctly:

  • Lazily creates one event loop per thread and reuses it
  • Checks both for missing loop (not hasattr) and closed loop (is_closed())
  • Calls set_event_loop to register the newly created loop with asyncio's thread-local state

This avoids the repeated ProactorEventLoop creation/teardown that was exhausting ephemeral ports.

Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Wouldn't the same make sense on Unix-like OSes as well?

@johanneskoester
Copy link
Copy Markdown
Contributor

Also, I have recently spotted that btop reports large numbers of threads for the main snakemake process on linux. Although they all seem to sleep, it seems to me that this might be related. Maybe the async spawns are not getting closed by Python?

@coroa
Copy link
Copy Markdown
Contributor

coroa commented Jan 8, 2026

Just wanted to chime in quickly that i debugged the same issue last week and came to a different fix by persisting an async_runner on the workflow object so that the job scheduler can just reuse the same async event loop: #3911.

The solution here is also quite elegant i find.

IMHO, ideally one would instead rewrite quite some of the snakemake inner logic to be async, and then just keep within the same async loop. I started doing that, too, but did not find a good approach for the eval statements in eval_resource_expression.

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 8, 2026

@johanneskoester

Looks good to me. Wouldn't the same make sense on Unix-like OSes as well?

I am not sure. The stalling/hanging is specific to Windows where python implements sockets with real TCP connections, which have a "cooling off" period before they are freed. I think linux uses some sort pipe to communicate which are freed immediately.

Also, using asyncio.run(), guarantees a completely clean slate for every call. There is no risk of hanging task or unclosed resources attached to the loop.

So I think it safest to leave this fix a Windows only.

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 8, 2026

@coroa Was your problem also on Windows?

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 8, 2026

@johanneskoester

Also, I have recently spotted that btop reports large numbers of threads for the main snakemake process on linux. Although they all seem to sleep, it seems to me that this might be related. Maybe the async spawns are not getting closed by Python?

Yes. They are related to the same underlying pattern: calling asyncio.run() inside the parallel job scheduler.

When we run concurrently (e.g., -j 20), the scheduler invokes asyncio.run for each worker thread, which creates a new Event Loop.

I believe every new Event Loop creates its own default ThreadPoolExecutor (usually cpu_count workers) if it needs to run blocking code.

On Linux: This results in a "pool of pools." 20 jobs × ~20 default threads = ~400 idle threads. They sleep (as you saw) so they don't consume CPU, but they do bloat the thread count.

On Windows: We hit a harder limit. Each new loop also creates a TCP socket pair. 20 jobs × rapid scheduling cycles = thousands of TCP connections which eventually exhausts the port range and crashes the process.

This PR fixes the crash on Windows by reusing the loop. On Linux, the high thread count is "safe" (if messy), but applying this same fix there would likely reduce that thread bloat as wel.

I keept the ON_WINDOWSguard, to minimize the effect of my change, but I am happy to enable it globally.

@coroa
Copy link
Copy Markdown
Contributor

coroa commented Jan 8, 2026

@coroa Was your problem also on Windows?

Yes, the actual issue only surfaced on windows in the proactor. Same stack trace as your py-spy one, but i have observed hangs on linux too (which might be unrelated).

    async_run(postprocess())
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\site-packages\snakemake\common\__init__.py",
line 100, in async_run
    return asyncio.run(coroutine)
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\runners.py", line 194, in run
    with Runner(debug=debug, loop_factory=loop_factory) as runner:
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\runners.py", line 58, in __enter__
    self._lazy_init()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\runners.py", line 137, in _lazy_init
    self._loop = events.new_event_loop()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\events.py", line 823, in new_event_loop
    return get_event_loop_policy().new_event_loop()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\events.py", line 720, in new_event_loop
    return self._loop_factory()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\windows_events.py", line 316, in
__init__
    super().__init__(proactor)
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\proactor_events.py", line 640, in
__init__
    self._make_self_pipe()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\asyncio\proactor_events.py", line 787, in
_make_self_pipe
    self._ssock, self._csock = socket.socketpair()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\socket.py", line 627, in _fallback_socketpair
    ssock, _ = lsock.accept()
  File "C:\Users\jonas\open-tyndp\.pixi\envs\open-tyndp\Lib\socket.py", line 295, in accept
    fd, addr = self._accept()

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 8, 2026

@johanneskoester, I could make a fix like this for all platforms:

_thread_local = threading.local()

def async_run(coroutine):
    """Attaches to running event loop or creates a new one to execute a
    coroutine.
    .. seealso::
         https://github.com/snakemake/snakemake/issues/1105
         https://stackoverflow.com/a/65696398
    """
    # We reuse a thread-local loop to avoid:
    # 1. On Windows: Port exhaustion (ProactorEventLoop creates TCP sockets)
    # 2. On Linux: Thread bloat (New loops spawn new ThreadPoolExecutors)
    if not hasattr(_thread_local, "loop") or _thread_local.loop.is_closed():
        _thread_local.loop = asyncio.new_event_loop()
        asyncio.set_event_loop(_thread_local.loop)
        # Limit inner threads to 1, effectively flattening the "Pool of Pools"
        _thread_local.loop.set_default_executor(
            concurrent.futures.ThreadPoolExecutor(max_workers=1)
        )
    try:
        return _thread_local.loop.run_until_complete(coroutine)
    except RuntimeError as e:
        coroutine.close()
        raise WorkflowError("Error running coroutine in event loop.", e)

Crucially, to fix the bloat, I can also limit the executor size on these internal loops. Since the code is already running inside a parallel Snakemake worker, it shouldn't need a massive thread pool of its own.

@johanneskoester /@coroa , do you think that there are any sideeffects of doing this?

@johanneskoester
Copy link
Copy Markdown
Contributor

@melund, this looks good to me as well. Now I am unsure which PR is prefable to be honest. @melund and @coroa, what are your opinions?

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 8, 2026

@johanneskoester, Just merge this windows PR right away.

I also made a PR (#3912) with the extended version for all OS'es. But I have no way to really test this on Linux, so I would think that the extended PR should only be merged once someone has used it for some real world applications.

@coroa
Copy link
Copy Markdown
Contributor

coroa commented Jan 8, 2026

@melund and @coroa, what are your opinions?

asyncio.run is not intended to be run repeatedly. for example the asyncio.run docs still have this comment: This function should be used as a main entry point for asyncio programs, and should ideally only be called once (that it became part of the loop_factory argument is a docs issue, refer to the original 3.7 docs).

I actually was also able to produce infrequent lock-ups on linux when using asyncio.to_thread in a storage plugin (PyPSA/snakemake-storage-plugin-cached-http@838c28d).

So i would not make it a windows only fix, but would suggest either #3912 or #3911 .

Which one? Puh. I originally thought, the unsolved issues in #3911 (not thread safe and resource expressions need to spawn a new async runner), made the global variable approach here a clear winner, but then i found solutions for both.

The main differences i see are:

#3911 uses the high-level asyncio.Runner interface (which was introduced in 3.11, which might be an issue) and cleans up the event loops when the runner is closed on tear down of the Workflow object, but it needs to jump through several hoops to make it available everywhere consistently.

#3912 uses the low-level event loop management interfaces to ensure that each thread reuses the same event loop, so the final effect is the same, except that the event loops are not closed at the end, which might hide exceptions in asyncio tasks (which snakemake is typically not using) and reuse the same event loops between several workflow instances (which snakemake typically does not have). But, it is a very local code change and will practically achieve the same outcome.

So, 🤷 .

I would not use set_default_executor, my understanding is that the executor is only instantiated when you want to make something synchronuous behave well within an async loop by using the high-level asyncio.to_thread or low level loop.run_in_executor.. as mentioned in run_in_executor the default executor is lazy-instantiated, so there is a high chance instantiating one and setting it as default does actually increase the number of threads rather than decrease it, with the negative effect that the utility of asyncio.to_thread is limited, i think.

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 9, 2026

@coroa You mentioned: "there is a high chance instantiating one ... does actually increase the number of threads rather than decrease it"

Actually, the motivation for set_default_executor(max_workers=1) is specifically because @johanneskoester reported seeing massive thread counts on Linux. I thought it indicated that that something is indeed lazily instantiating these executors.

The default ThreadPoolExecutor spawns lots of workers, and this happens for all the loops.

By forcing max_workers=1, we ensure that when the executor is instantiated, it caps at 1 thread. I was thinking it would solve the "Pool of Pools" explosion while keeping the change local and simple.

I think someone needs to actual try to run snakemake with the change in #3912, and see what it does with the thread bloat on Linux. Until then I wote for merging this PR to fix the critical issue on Windows with a minimal code.

@coroa
Copy link
Copy Markdown
Contributor

coroa commented Jan 9, 2026

@coroa You mentioned: "there is a high chance instantiating one ... does actually increase the number of threads rather than decrease it"

Actually, the motivation for set_default_executor(max_workers=1) is specifically because @johanneskoester reported seeing massive thread counts on Linux. I thought it indicated that that something is indeed lazily instantiating these executors.

I understand that, and your fix w/o the set_default_executor(ThreadPoolExecutor(max_workers=1)) will also already fix the thread explosion, since you limit the number of async event loops to the number of workers plus the one on the main thread.

What i am unclear/worried about is that if you don't set the default executor, the event loop will instantiate one, WHEN it is needed, and i think it is not needed in most cases. If you instantiate an executor to set it as the default then the additional single thread is there regardless if it is used.

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 9, 2026

Yes. You are right. In this case it would always instantiate a single worker-thread for every concurrent job. But I suspect that there could be something that is indeed lazily instantiating these executors (likely aiofiles, implicit run_in_executor calls, or plugins).

The default ThreadPoolExecutor spawns min(32, os.cpu_count() + 4) worker-threads. With 20 concurrent jobs --> 20 loops -> 20 Default Executors × ~20 worker-threads = 400+ threads.

My fist PR, and yours does nothing to fix that issue (I think 😅 ). It just prevents that they are not constantly closed and created again.

Of course if you could remove the places that instantiate the ThreadPoolExecutor's, you could maybe solve the problem in an other way.

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 9, 2026

@johanneskoester I fine with @coroa's solution to the issues on Windows. I haven't run my large pipelines with his PR yet, but I assume they would solve the problem as well.

So if you like that solution better, just merge that. I am looking forward to new version of snakemake which doesn't fail randomly on Windows.

@coroa
Copy link
Copy Markdown
Contributor

coroa commented Jan 12, 2026

But I suspect that there could be something that is indeed lazily instantiating these executors (likely aiofiles, implicit run_in_executor calls, or plugins).

Ok, i did a very loose review job of how often run_in_executor appears in base libraries and you are right it's more often than i thought. it's used by aiofiles, as by aiohttp. So this might launch it more often.

What do you think about sharing a default ThreadPoolExecutor across all asyncio loops?

@melund
Copy link
Copy Markdown
Contributor Author

melund commented Jan 12, 2026

What do you think about sharing a default ThreadPoolExecutor across all asyncio loops?

I don't see any problems with that, but that doesn't count for much since I am not well versed in asyncio, and how the different parts play together. I think that part could also come in a follow up PR - I am really hoping for a new snakemake release soon 😅

@johanneskoester
Copy link
Copy Markdown
Contributor

Closing in favor of #3911.

@johanneskoester johanneskoester deleted the win-loop branch January 15, 2026 22:17
johanneskoester pushed a commit that referenced this pull request Jan 15, 2026
Interestingly i debugged the same issue as reported in #3909 last week,
and came to a slightly different implementation for reusing the
asynchronuous loop.

The benefit is that it is more explicit, than the implicit reuse. The
drawback is that you still have multiple async runners, since the
`async_run` in
[`InputFiles._predicated_size_files`](https://github.com/coroa/snakemake/blob/401c86946e72f4e2112ae393ae745dcccc2ebe91/src/snakemake/io/__init__.py#L1927-L1937)
cannot access the workflow object, when it is being used through
[`eval_resource_expression`](https://github.com/coroa/snakemake/blob/401c86946e72f4e2112ae393ae745dcccc2ebe91/src/snakemake/resources.py#L526-L571).

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Centralized asynchronous dispatch into per-workflow runners with
orderly creation and teardown, replacing scattered external async calls.

* **New Features**
* Workflow-managed async execution now drives scheduling, execution,
caching, postprocessing and touch operations.
* Resource/input/benchmark evaluation callables and input helpers can
accept and propagate an async execution handle.

* **Bug Fixes**
* Fixed resource iteration extraction and simplified an IO-file error
message for clearer output.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants