[PluggableDevice] Add device alias support #43611

jzhoulon · 2020-09-28T01:51:26Z

Adding a device alias support for resolving kernel registration
conflict between third party devices and first party devices in TensorFlow.
For first party devices(GPU, CPU) in Tensorflow, alias is empty.
For third party devices, alias is registered from plugin

jzhoulon · 2020-09-28T01:52:40Z

@annarev @penpornk @yisitu This is the second PR which is for subdevice type support in TensorFlow, please help to review it. Thanks

Adding a subdevice type support for resoving kernel registration conflict between plugged device and default device in TensorFlow. For those default devices(GPU, CPU) in Tensorflow, subdevice is the same as device type For those plugged third-party devices, subdevice type is registered from plugin

tensorflow/c/kernels_test.cc

tensorflow/core/BUILD

tensorflow/core/common_runtime/device_factory.cc

tensorflow/core/framework/kernel_def.proto

tensorflow/core/grappler/clusters/BUILD

tensorflow/python/eager/context.py

gbaned · 2020-10-06T17:33:49Z

@jzhoulon Can you please check @annarev's comments and keep us posted ? Thanks!

yiqianglee · 2020-10-07T01:16:30Z

@gbaned , sorry to late response, we are in a long public holiday last week, will address all comments soon. Thanks!

jzhoulon · 2020-10-09T13:41:13Z

@annarev @gbaned all the comments are addressed, please help to review. Thanks.

tensorflow/python/eager/context_test.py

tensorflow/python/eager/context.py

tensorflow/c/kernels_test.cc

…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. PiperOrigin-RevId: 336975046 Change-Id: I70c8c6594b730e16d6245bf58005b572f0a38ce3

…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. PiperOrigin-RevId: 337019943 Change-Id: I96a896970e73782b1ba7734b6d508d4d77d1b295

jzhoulon · 2020-10-14T14:07:17Z

@annarev It seems that device_factory moved back to core/common_runtime and cause the conflict, should we add the header dependency back? thanks

…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency. This change was originally rolled back due to "CPU Factory not registered. Did you link in threadpool_device?" on Windows. The error happened because both tensorflow.dll and pywrap_tfe.dll contained device_factory implementation. CPU was registered in tensorflow.dll but queried in pywrap_tfe.dll. This should be fixed now by cleaning up pywrap_tfe.dll dependencies (f5af1da). PiperOrigin-RevId: 337585576 Change-Id: Idec2c8ad3ad32e02b972ab0438cc3c232b6be532

annarev · 2020-10-16T23:49:58Z

@annarev It seems that device_factory moved back to core/common_runtime and cause the conflict, should we add the header dependency back? thanks

Yep, sorry for confusion. I had to roll it back due to a build failure. However, it has been fixed today and re-committed: c60ced5

annarev

This looks good. Just a few more comments.

tensorflow/python/eager/context.py

tensorflow/python/eager/context_test.py

tensorflow/python/eager/context.py

gbaned · 2020-10-19T13:42:41Z

@jzhoulon Can you please check @annarev's comments and keep us posted ? Thanks!

yiqianglee · 2020-12-01T08:39:11Z

Thanks @sanjoy to bring the idea and discussion, seems alias device type is still the better way we can see currently, removing alias device type will bring some additional complex design, like eager support (need model change, not transparent to user, also front end "GPU" device redirect to pluggable device need to be added), graph optimization execution is also needed to change. Any idea?

@annarev @penpornk Do you have any comments?

jpienaar · 2020-12-01T16:23:47Z

The other concern is that since the grappler pass for Modular TF Graph happens at the last step

Separately, have you considered putting the modular TF Graph pass at the first step of the pipeline? That will probably be more future proof -- as we convert more things in the pipeline to use MLIR, it would be harder to have a GraphDef representation of a TF graph that has gone through some stages of optimization. OTOH we will have to support SavedModel for a long time to come anyway, so having a GraphDef representation for in the first stage of the optimization pipeline is probably OK.

CC @jpienaar

Yeah, so if this were to run as part of import passes then yes you'd still have GraphDef (well for SavedModel you would at loading, during execution you have the similar but different tensorflow::Graph). As GraphDef gets used less for internal graph representation and optimizations, we can avoid extra overhead of converting to GraphDef - IMHO we already convert too often between the two formats, and I'd rather iterate to where we only have GraphDef as resting input format.

Jianhui-Li · 2020-12-01T20:29:07Z

The other concern is that since the grappler pass for Modular TF Graph happens at the last step

Separately, have you considered putting the modular TF Graph pass at the first step of the pipeline? That will probably be more future proof -- as we convert more things in the pipeline to use MLIR, it would be harder to have a GraphDef representation of a TF graph that has gone through some stages of optimization. OTOH we will have to support SavedModel for a long time to come anyway, so having a GraphDef representation for in the first stage of the optimization pipeline is probably OK.
CC @jpienaar

Yeah, so if this were to run as part of import passes then yes you'd still have GraphDef (well for SavedModel you would at loading, during execution you have the similar but different tensorflow::Graph). As GraphDef gets used less for internal graph representation and optimizations, we can avoid extra overhead of converting to GraphDef - IMHO we already convert too often between the two formats, and I'd rather iterate to where we only have GraphDef as resting input format.

Putting the modular TF Graph pass at the first step of the pipeline won't work for pluggable devices. The pluggable devices need to get a chance to conduct target-depedent optimization, which may invalidate target-independent grappler passes.

rjpower · 2020-12-01T23:17:55Z

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

annarev · 2020-12-01T23:23:39Z

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

One reason that comes to mind is conditional checks. We might be checking if something is GPU to run general GPU logic, or we might be checking that something is GPU to run CUDA specific logic. We could probably differentiate whether something is a pluggable device or not instead (if pluggable, don't run CUDA-specific logic).

Another reason is supporting running code only on specific type of GPU. For e.g having with tf.device('/INTEL_GPU:0') (say, fail if INTEL_GPU is not available in this case). I don't know though if there is a known usecase for this though.

yiqianglee · 2020-12-02T00:09:19Z

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

One reason that comes to mind is conditional checks. We might be checking if something is GPU to run general GPU logic, or we might be checking that something is GPU to run CUDA specific logic. We could probably differentiate whether something is a pluggable device or not instead (if pluggable, don't run CUDA-specific logic).

Another reason is supporting running code only on specific type of GPU. For e.g having with tf.device('/INTEL_GPU:0') (say, fail if INTEL_GPU is not available in this case). I don't know though if there is a known usecase for this though.

And also, alias over existing name will give pluggable device the opportunity to run legacy model (with tf.device('/gpu:0')) without changing any user code.

rjpower · 2020-12-04T00:02:49Z

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.

If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

Jianhui-Li · 2020-12-04T07:58:57Z

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.

If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

I would say yes, we do see a need for us to run the existing TF application with "device:GPU" as is. GPU has a broad market so we expect more entry-level users and we want to enable them transparently. If TF's "device:GPU" implies CUDA, it is against the spirit of modular TF, which intends to provide a vendor-neutral interface. We are working on CUDA specific code to make it more general.

kulinseth · 2020-12-07T23:38:18Z

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.
If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

I would say yes, we do see a need for us to run the existing TF application with "device:GPU" as is. GPU has a broad market so we expect more entry-level users and we want to enable them transparently. If TF's "device:GPU" implies CUDA, it is against the spirit of modular TF, which intends to provide a vendor-neutral interface. We are working on CUDA specific code to make it more general.

I joined the discussion late, so please correct me if I missed some context here. As @Jianhui-Li mentioned it would be great if can enable existing TF applications on GPU seamlessly. As we are adding Metal backend support we noticed this dependency as well and as part of pluggable design it gives an opportunity to straighten it out.

@rjpower Is the concern here that the change has wider impact and will break existing code-paths in DEVICE_GPU?

gbaned · 2020-12-23T18:37:29Z

@rjpower Any update on this PR? Please. Thanks!

penpornk · 2020-12-23T18:54:08Z

@kulinseth Sorry for the late reply! I believe the concern here is that whatever we choose here (having a device_alias or not), it might restrict how TensorFlow does device placement in the future. And there have been several existing issues with device placement to begin with. So some of us wanted to explore the impact on that side more before we proceed here. This PR is not blocking other PluggableDevice PRs and will likely be merged last. (We are still aiming for TF 2.5.)

@gbaned This PR will be on hold for a while until we can make a decision on how we want to move forward.

kulinseth · 2021-01-07T10:58:37Z

@penpornk Thanks for the reply. Makes sense and since this is not blocking other PRs, we can take a look later and not hold them up.

penpornk · 2021-01-14T22:45:22Z

(Removing the API review tag for now. Will add it back once the PR is ready for API review again.)

gbaned · 2021-02-04T18:11:15Z

@penpornk Any update on this PR? Please. Thanks!

penpornk · 2021-02-04T23:03:22Z

@gbaned It's still on hold. Sorry!

gbaned · 2021-04-21T16:22:50Z

@penpornk Any update on this PR? Please. Thanks!

penpornk · 2021-04-21T17:25:35Z

@gbaned We are exploring an alternative. I'll temporarily close this PR. We can reopen it if later on if we want to fall back to this one. Sorry for the inconveniences!

jzhoulon requested review from kkimdev and rohan100jain as code owners September 28, 2020 01:51

google-ml-butler bot added the size:M CL Change Size: Medium label Sep 28, 2020

googlebot added the cla: yes label Sep 28, 2020

gbaned self-assigned this Sep 28, 2020

penpornk requested a review from annarev September 28, 2020 17:43

gbaned added the awaiting review Pull request awaiting review label Sep 30, 2020

annarev reviewed Oct 1, 2020

View reviewed changes

tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 3, 2020

gbaned added the stat:awaiting response Status - Awaiting response from author label Oct 6, 2020

gbaned removed the stat:awaiting response Status - Awaiting response from author label Oct 9, 2020

jzhoulon added 2 commits October 9, 2020 20:38

address review comments

fb4e56c

simplify build dependence

bb7886c

annarev reviewed Oct 10, 2020

View reviewed changes

tensorflow/python/eager/context_test.py Show resolved Hide resolved

tensorflow/python/eager/context.py Outdated Show resolved Hide resolved

tensorflow/c/kernels_test.cc Outdated Show resolved Hide resolved

Merge branch 'master' into pluggable_device_subdevice_type

5923a12

jzhoulon added 2 commits October 14, 2020 21:14

fix merge conflict

ca15d8c

address comment

9fcf868

annarev reviewed Oct 17, 2020

View reviewed changes

tensorflow/python/eager/context.py Outdated Show resolved Hide resolved

tensorflow/python/eager/context_test.py Show resolved Hide resolved

tensorflow/python/eager/context.py Show resolved Hide resolved

tensorflow/python/eager/context.py Outdated Show resolved Hide resolved

gbaned added the stat:awaiting response Status - Awaiting response from author label Oct 19, 2020

rohan100jain requested a review from rjpower December 9, 2020 21:31

gbaned added the awaiting review Pull request awaiting review label Dec 16, 2020

gbaned added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed awaiting review Pull request awaiting review labels Dec 23, 2020

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 25, 2020

penpornk removed the API review API Review label Jan 14, 2021

penpornk mentioned this pull request Feb 2, 2021

Add tutorial and sample code for TensorFlow plugin development tensorflow/community#352

Merged

gbaned added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 15, 2021

bhack mentioned this pull request Mar 2, 2021

custom hardware support for tensorflow #46618

Closed

penpornk mentioned this pull request Mar 8, 2021

[PluggableDevice] PluggableDevice mechanism implementation #45784

Closed

penpornk closed this Apr 21, 2021

penpornk mentioned this pull request May 10, 2021

RFC: TensorFlow on DirectML tensorflow/community#243

Merged

penpornk mentioned this pull request Jul 29, 2022

Fix duplicate kernel registration errors when using a Pluggable Device with the GPU type #56707

Closed

[PluggableDevice] Add device alias support #43611

[PluggableDevice] Add device alias support #43611

Uh oh!

Conversation

jzhoulon commented Sep 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jzhoulon commented Sep 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gbaned commented Oct 6, 2020

Uh oh!

yiqianglee commented Oct 7, 2020

Uh oh!

jzhoulon commented Oct 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jzhoulon commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annarev commented Oct 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annarev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gbaned commented Oct 19, 2020

Uh oh!

yiqianglee commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpienaar commented Dec 1, 2020

Uh oh!

Jianhui-Li commented Dec 1, 2020

Uh oh!

rjpower commented Dec 1, 2020

Uh oh!

annarev commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiqianglee commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjpower commented Dec 4, 2020

Uh oh!

Jianhui-Li commented Dec 4, 2020

Uh oh!

kulinseth commented Dec 7, 2020

Uh oh!

gbaned commented Dec 23, 2020

Uh oh!

penpornk commented Dec 23, 2020

Uh oh!

kulinseth commented Jan 7, 2021

Uh oh!

penpornk commented Jan 14, 2021

Uh oh!

gbaned commented Feb 4, 2021

Uh oh!

penpornk commented Feb 4, 2021

Uh oh!

gbaned commented Apr 21, 2021

Uh oh!

penpornk commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

jzhoulon commented Sep 28, 2020 •

edited

Loading

jzhoulon commented Oct 14, 2020 •

edited

Loading

annarev commented Oct 16, 2020 •

edited

Loading

yiqianglee commented Dec 1, 2020 •

edited

Loading

annarev commented Dec 1, 2020 •

edited

Loading

yiqianglee commented Dec 2, 2020 •

edited

Loading

penpornk commented Apr 21, 2021 •

edited

Loading