Skip to content

Clean up the leftover connection for finished threads in pinned thread mode#471

Merged
HyukjinKwon merged 1 commit intopy4j:masterfrom
HyukjinKwon:cleanup-connection
Mar 14, 2022
Merged

Clean up the leftover connection for finished threads in pinned thread mode#471
HyukjinKwon merged 1 commit intopy4j:masterfrom
HyukjinKwon:cleanup-connection

Conversation

@HyukjinKwon
Copy link
Copy Markdown
Member

@HyukjinKwon HyukjinKwon commented Mar 7, 2022

What is the problem?

Correctly, there is resource leak when using the pinned thread mode (see also apache/spark#24898).

For example, if you repeat the codes below multiple times to create Py4J connections in multiple threads,

# PySpark application
import threading

def print_prop():
    # Py4J connection is used under the hood.
    print(spark.sparkContext.getLocalProperty("a"))

threading.Thread(target=print_prop).start()

the number of leftover connections grows:

spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x7fdc60170940>, <py4j.clientserver.ClientServerConnection object at 0x7fdca011e760>, <py4j.clientserver.ClientServerConnection object at 0x7fdcb01acdc0>, <py4j.clientserver.ClientServerConnection object at 0x7fdc60170100>, <py4j.clientserver.ClientServerConnection object at 0x7fdcb0232d30>])

In the environment where multiple threads are used without a pool, it easily causes "Too many files open" due to the lack of file descriptors (as they are all occupied by unclosed sockets).

How do you fix?

This PR adds another variable to thread local that cleans up the connection right before the thread is finished. We need it as a separate thread local because connection is NOT cleaned because the reference is being held at JavaClient.deque.

See also 50fe45e for more details.

@HyukjinKwon
Copy link
Copy Markdown
Member Author

cc @bartdag I badly wanted to make this fix. I am happy that I am finally able to submit this PR.

@WeichenXu123
Copy link
Copy Markdown

@HyukjinKwon awesome work to fix this!

@HyukjinKwon HyukjinKwon force-pushed the cleanup-connection branch 9 times, most recently from 2060594 to c093d12 Compare March 8, 2022 06:16
@HyukjinKwon HyukjinKwon marked this pull request as draft March 8, 2022 06:49
@HyukjinKwon HyukjinKwon marked this pull request as ready for review March 10, 2022 05:25
@HyukjinKwon
Copy link
Copy Markdown
Member Author

it should be ready for a look now.

Copy link
Copy Markdown
Collaborator

@bartdag bartdag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. It took me a while to understand what was causing the issue in the first place.

I made a few suggestions, but I do not feel strongly towards them. Feel free to take what you want from them!

Thinking about the root cause, in the future, we may want to add a method to allow threads to explicitly tell when their work is done. With the default gateway approach, you can always call GatewayClient.close() to close all connections, but this does not make sense in the pinned thread model (you want to close your connection, not all connections).

That being said, your solution is the only sane way to stay backward compatible, so kudos!!

@HyukjinKwon HyukjinKwon merged commit 799ac4e into py4j:master Mar 14, 2022
@HyukjinKwon
Copy link
Copy Markdown
Member Author

Merged!

@HyukjinKwon HyukjinKwon added this to the 0.10.9.4 milestone Mar 16, 2022
HyukjinKwon added a commit to apache/spark that referenced this pull request Mar 16, 2022
### What changes were proposed in this pull request?

This PR upgrade Py4J 0.10.9.4, with relevant documentation changes.

### Why are the changes needed?

Py4J 0.10.9.3 has a resource leak issue when pinned thread mode is enabled - it's enabled by default in PySpark at 41af409.
We worked around this by enforcing users to use `InheritableThread` or `inhteritable_thread_target` as a workaround.
After upgrading, we don't need to enforce users anymore because it automatically cleans up, see also py4j/py4j#471

### Does this PR introduce _any_ user-facing change?

Yes, users don't have to use `InheritableThread` or `inhteritable_thread_target` to avoid resource leaking problem anymore.

### How was this patch tested?

CI in this PR should test it out.

Closes #35871 from HyukjinKwon/SPARK-38563.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8193b40)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit to apache/spark that referenced this pull request Mar 16, 2022
### What changes were proposed in this pull request?

This PR upgrade Py4J 0.10.9.4, with relevant documentation changes.

### Why are the changes needed?

Py4J 0.10.9.3 has a resource leak issue when pinned thread mode is enabled - it's enabled by default in PySpark at 41af409.
We worked around this by enforcing users to use `InheritableThread` or `inhteritable_thread_target` as a workaround.
After upgrading, we don't need to enforce users anymore because it automatically cleans up, see also py4j/py4j#471

### Does this PR introduce _any_ user-facing change?

Yes, users don't have to use `InheritableThread` or `inhteritable_thread_target` to avoid resource leaking problem anymore.

### How was this patch tested?

CI in this PR should test it out.

Closes #35871 from HyukjinKwon/SPARK-38563.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit to apache/spark that referenced this pull request Mar 16, 2022
This PR upgrade Py4J 0.10.9.4, with relevant documentation changes.

Py4J 0.10.9.3 has a resource leak issue when pinned thread mode is enabled - it's enabled by default in PySpark at 41af409.
We worked around this by enforcing users to use `InheritableThread` or `inhteritable_thread_target` as a workaround.
After upgrading, we don't need to enforce users anymore because it automatically cleans up, see also py4j/py4j#471

Yes, users don't have to use `InheritableThread` or `inhteritable_thread_target` to avoid resource leaking problem anymore.

CI in this PR should test it out.

Closes #35871 from HyukjinKwon/SPARK-38563.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8193b40)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants