Skip to content

[APP] ROCM RPC#1155

Merged
tqchen merged 1 commit intoapache:masterfrom
tqchen:master
May 11, 2018
Merged

[APP] ROCM RPC#1155
tqchen merged 1 commit intoapache:masterfrom
tqchen:master

Conversation

@tqchen
Copy link
Member

@tqchen tqchen commented May 11, 2018

cc @adityaatluri

The eager initialization of ROCm(specifically hip_hcc) might cause other problems, basically making it incompatible with multiprocessing in python

@tqchen tqchen merged commit 7233ce1 into apache:master May 11, 2018
@bensander
Copy link

Can you say more on which piece of the eager initialization causes issues?

@tqchen
Copy link
Member Author

tqchen commented May 11, 2018

Basically, the following code paradigm no longer works

no call into ROCM API before this
fork process using multiprocessing
use ROCM API

The above code sequence would work for CUDA and opencl because their states are lazily initialized. This could prevent some possible usages. In our case, we need to fork new process to provide isolation of session for RPC. Some deep learning libraries might start by using python's multi-processing to start new processes to load data(which don't need GPU) and then use GPU API, the eager initialization prevent us from doing that.

@tqchen
Copy link
Member Author

tqchen commented May 11, 2018

@bensander I am not sure which piece it is but should it should be part of hsa runtime. You should be able to reproduce this by crafting a program that forks at the beginning of program and then use ROCM API.

@tqchen
Copy link
Member Author

tqchen commented May 11, 2018

@adityaatluri My guess is that such eager initialization was due to use of static global variable or something related to that, instead of using a static function that lazily initializes things in first call.
ref https://isocpp.org/wiki/faq/ctors#static-init-order

@tqchen
Copy link
Member Author

tqchen commented May 11, 2018

@bensander
Copy link

Got it - we have seen this behavior in a few other situations but seems it is becoming more popular here. IIRC the issue was related to mapping queues into user-space memory which breaks fork. We had a couple solutions, let me follow up and see what we can do.

@aditya4d
Copy link

How can I reproduce the issue?

tqchen added a commit to tqchen/tvm that referenced this pull request Jul 6, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants