-
Notifications
You must be signed in to change notification settings - Fork 75.2k
[PluggableDevice] Add device alias support #43611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PluggableDevice] Add device alias support #43611
Conversation
Adding a subdevice type support for resoving kernel registration conflict between plugged device and default device in TensorFlow. For those default devices(GPU, CPU) in Tensorflow, subdevice is the same as device type For those plugged third-party devices, subdevice type is registered from plugin
|
@gbaned , sorry to late response, we are in a long public holiday last week, will address all comments soon. Thanks! |
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. PiperOrigin-RevId: 336975046 Change-Id: I70c8c6594b730e16d6245bf58005b572f0a38ce3
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. PiperOrigin-RevId: 337019943 Change-Id: I96a896970e73782b1ba7734b6d508d4d77d1b295
|
@annarev It seems that device_factory moved back to core/common_runtime and cause the conflict, should we add the header dependency back? thanks |
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. This change was originally rolled back due to "CPU Factory not registered. Did you link in threadpool_device?" on Windows. The error happened because both tensorflow.dll and pywrap_tfe.dll contained device_factory implementation. CPU was registered in tensorflow.dll but queried in pywrap_tfe.dll. This should be fixed now by cleaning up pywrap_tfe.dll dependencies (f5af1da). PiperOrigin-RevId: 337585576 Change-Id: Idec2c8ad3ad32e02b972ab0438cc3c232b6be532
annarev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Just a few more comments.
|
Thanks @sanjoy to bring the idea and discussion, seems alias device type is still the better way we can see currently, removing alias device type will bring some additional complex design, like eager support (need model change, not transparent to user, also front end "GPU" device redirect to pluggable device need to be added), graph optimization execution is also needed to change. Any idea? |
Yeah, so if this were to run as part of import passes then yes you'd still have GraphDef (well for SavedModel you would at loading, during execution you have the similar but different tensorflow::Graph). As GraphDef gets used less for internal graph representation and optimizations, we can avoid extra overhead of converting to GraphDef - IMHO we already convert too often between the two formats, and I'd rather iterate to where we only have GraphDef as resting input format. |
Putting the modular TF Graph pass at the first step of the pipeline won't work for pluggable devices. The pluggable devices need to get a chance to conduct target-depedent optimization, which may invalidate target-independent grappler passes. |
|
What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU". |
One reason that comes to mind is conditional checks. We might be checking if something is GPU to run general GPU logic, or we might be checking that something is GPU to run CUDA specific logic. We could probably differentiate whether something is a pluggable device or not instead (if pluggable, don't run CUDA-specific logic). Another reason is supporting running code only on specific type of GPU. For e.g having |
And also, alias over existing name will give pluggable device the opportunity to run legacy model ( |
|
I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it. If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better. |
I would say yes, we do see a need for us to run the existing TF application with "device:GPU" as is. GPU has a broad market so we expect more entry-level users and we want to enable them transparently. If TF's "device:GPU" implies CUDA, it is against the spirit of modular TF, which intends to provide a vendor-neutral interface. We are working on CUDA specific code to make it more general. |
I joined the discussion late, so please correct me if I missed some context here. As @Jianhui-Li mentioned it would be great if can enable existing TF applications on GPU seamlessly. As we are adding Metal backend support we noticed this dependency as well and as part of pluggable design it gives an opportunity to straighten it out. @rjpower Is the concern here that the change has wider impact and will break existing code-paths in DEVICE_GPU? |
|
@rjpower Any update on this PR? Please. Thanks! |
|
@kulinseth Sorry for the late reply! I believe the concern here is that whatever we choose here (having a @gbaned This PR will be on hold for a while until we can make a decision on how we want to move forward. |
|
@penpornk Thanks for the reply. Makes sense and since this is not blocking other PRs, we can take a look later and not hold them up. |
|
(Removing the |
|
@penpornk Any update on this PR? Please. Thanks! |
|
@gbaned It's still on hold. Sorry! |
|
@penpornk Any update on this PR? Please. Thanks! |
|
@gbaned We are exploring an alternative. I'll temporarily close this PR. We can reopen it if later on if we want to fall back to this one. Sorry for the inconveniences! |
Adding a device alias support for resolving kernel registration
conflict between third party devices and first party devices in TensorFlow.
For first party devices(GPU, CPU) in Tensorflow, alias is empty.
For third party devices, alias is registered from plugin