Skip to content

Conversation

@jzhoulon
Copy link
Contributor

@jzhoulon jzhoulon commented Sep 28, 2020

Adding a device alias support for resolving kernel registration
conflict between third party devices and first party devices in TensorFlow.
For first party devices(GPU, CPU) in Tensorflow, alias is empty.
For third party devices, alias is registered from plugin

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Sep 28, 2020
@jzhoulon
Copy link
Contributor Author

@annarev @penpornk @yisitu This is the second PR which is for subdevice type support in TensorFlow, please help to review it. Thanks

@gbaned gbaned self-assigned this Sep 28, 2020
Adding a subdevice type support for resoving kernel registration
conflict between plugged device and default device in TensorFlow.
For those default devices(GPU, CPU) in Tensorflow, subdevice is the
same as device type
For those plugged third-party devices, subdevice type is registered
from plugin
@penpornk penpornk requested a review from annarev September 28, 2020 17:43
@gbaned gbaned added the awaiting review Pull request awaiting review label Sep 30, 2020
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Oct 3, 2020
@gbaned
Copy link
Contributor

gbaned commented Oct 6, 2020

@jzhoulon Can you please check @annarev's comments and keep us posted ? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Oct 6, 2020
@yiqianglee
Copy link
Contributor

@gbaned , sorry to late response, we are in a long public holiday last week, will address all comments soon. Thanks!

@jzhoulon
Copy link
Contributor Author

jzhoulon commented Oct 9, 2020

@annarev @gbaned all the comments are addressed, please help to review. Thanks.

@gbaned gbaned removed the stat:awaiting response Status - Awaiting response from author label Oct 9, 2020
copybara-service bot pushed a commit that referenced this pull request Oct 13, 2020
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency.

PiperOrigin-RevId: 336975046
Change-Id: I70c8c6594b730e16d6245bf58005b572f0a38ce3
copybara-service bot pushed a commit that referenced this pull request Oct 14, 2020
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency.

PiperOrigin-RevId: 337019943
Change-Id: I96a896970e73782b1ba7734b6d508d4d77d1b295
@jzhoulon
Copy link
Contributor Author

jzhoulon commented Oct 14, 2020

@annarev It seems that device_factory moved back to core/common_runtime and cause the conflict, should we add the header dependency back? thanks

copybara-service bot pushed a commit that referenced this pull request Oct 16, 2020
…ss device registry to get a subtype for a given device type as a part of kernel registration. Kernel registration is a part of framework and should not depend on common_runtime to avoid a circular dependency.

This change was originally rolled back due to "CPU Factory not registered. Did you link in threadpool_device?" on Windows. The error happened because both tensorflow.dll and pywrap_tfe.dll contained device_factory implementation. CPU was registered in tensorflow.dll but queried in pywrap_tfe.dll. This should be fixed now by cleaning up pywrap_tfe.dll dependencies (f5af1da).

PiperOrigin-RevId: 337585576
Change-Id: Idec2c8ad3ad32e02b972ab0438cc3c232b6be532
@annarev
Copy link
Contributor

annarev commented Oct 16, 2020

@annarev It seems that device_factory moved back to core/common_runtime and cause the conflict, should we add the header dependency back? thanks

Yep, sorry for confusion. I had to roll it back due to a build failure. However, it has been fixed today and re-committed: c60ced5

Copy link
Contributor

@annarev annarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Just a few more comments.

@gbaned
Copy link
Contributor

gbaned commented Oct 19, 2020

@jzhoulon Can you please check @annarev's comments and keep us posted ? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Oct 19, 2020
@yiqianglee
Copy link
Contributor

yiqianglee commented Dec 1, 2020

Thanks @sanjoy to bring the idea and discussion, seems alias device type is still the better way we can see currently, removing alias device type will bring some additional complex design, like eager support (need model change, not transparent to user, also front end "GPU" device redirect to pluggable device need to be added), graph optimization execution is also needed to change. Any idea?

@annarev @penpornk Do you have any comments?

@jpienaar
Copy link
Member

jpienaar commented Dec 1, 2020

The other concern is that since the grappler pass for Modular TF Graph happens at the last step

Separately, have you considered putting the modular TF Graph pass at the first step of the pipeline? That will probably be more future proof -- as we convert more things in the pipeline to use MLIR, it would be harder to have a GraphDef representation of a TF graph that has gone through some stages of optimization. OTOH we will have to support SavedModel for a long time to come anyway, so having a GraphDef representation for in the first stage of the optimization pipeline is probably OK.

CC @jpienaar

Yeah, so if this were to run as part of import passes then yes you'd still have GraphDef (well for SavedModel you would at loading, during execution you have the similar but different tensorflow::Graph). As GraphDef gets used less for internal graph representation and optimizations, we can avoid extra overhead of converting to GraphDef - IMHO we already convert too often between the two formats, and I'd rather iterate to where we only have GraphDef as resting input format.

@Jianhui-Li
Copy link

The other concern is that since the grappler pass for Modular TF Graph happens at the last step

Separately, have you considered putting the modular TF Graph pass at the first step of the pipeline? That will probably be more future proof -- as we convert more things in the pipeline to use MLIR, it would be harder to have a GraphDef representation of a TF graph that has gone through some stages of optimization. OTOH we will have to support SavedModel for a long time to come anyway, so having a GraphDef representation for in the first stage of the optimization pipeline is probably OK.
CC @jpienaar

Yeah, so if this were to run as part of import passes then yes you'd still have GraphDef (well for SavedModel you would at loading, during execution you have the similar but different tensorflow::Graph). As GraphDef gets used less for internal graph representation and optimizations, we can avoid extra overhead of converting to GraphDef - IMHO we already convert too often between the two formats, and I'd rather iterate to where we only have GraphDef as resting input format.

Putting the modular TF Graph pass at the first step of the pipeline won't work for pluggable devices. The pluggable devices need to get a chance to conduct target-depedent optimization, which may invalidate target-independent grappler passes.

@rjpower
Copy link
Contributor

rjpower commented Dec 1, 2020

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

@annarev
Copy link
Contributor

annarev commented Dec 1, 2020

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

One reason that comes to mind is conditional checks. We might be checking if something is GPU to run general GPU logic, or we might be checking that something is GPU to run CUDA specific logic. We could probably differentiate whether something is a pluggable device or not instead (if pluggable, don't run CUDA-specific logic).

Another reason is supporting running code only on specific type of GPU. For e.g having with tf.device('/INTEL_GPU:0') (say, fail if INTEL_GPU is not available in this case). I don't know though if there is a known usecase for this though.

@yiqianglee
Copy link
Contributor

yiqianglee commented Dec 2, 2020

What's the underlying reason to alias over the existing name? I don't think we encode any particular meaning in the fact that something is a "GPU" vs "TPU" vs "XPU".

One reason that comes to mind is conditional checks. We might be checking if something is GPU to run general GPU logic, or we might be checking that something is GPU to run CUDA specific logic. We could probably differentiate whether something is a pluggable device or not instead (if pluggable, don't run CUDA-specific logic).

Another reason is supporting running code only on specific type of GPU. For e.g having with tf.device('/INTEL_GPU:0') (say, fail if INTEL_GPU is not available in this case). I don't know though if there is a known usecase for this though.

And also, alias over existing name will give pluggable device the opportunity to run legacy model (with tf.device('/gpu:0')) without changing any user code.

@rjpower
Copy link
Contributor

rjpower commented Dec 4, 2020

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.

If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

@Jianhui-Li
Copy link

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.

If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

I would say yes, we do see a need for us to run the existing TF application with "device:GPU" as is. GPU has a broad market so we expect more entry-level users and we want to enable them transparently. If TF's "device:GPU" implies CUDA, it is against the spirit of modular TF, which intends to provide a vendor-neutral interface. We are working on CUDA specific code to make it more general.

@kulinseth
Copy link
Contributor

I just want to clarify: is aliasing necessary for custom devices to work at all? Or is this a convenience to support existing models? (I don't think it's required, because things like TPUs work, but please correct me if I'm wrong). Existing code that checks for "device:GPU" is likely specialized for CUDA GPUs, so I don't think it's correct in general to assume we can alias over it.
If it's not a hard requirement, I'd recommend we move ahead with the other PRs related to pluggable devices until we understand this a bit better.

I would say yes, we do see a need for us to run the existing TF application with "device:GPU" as is. GPU has a broad market so we expect more entry-level users and we want to enable them transparently. If TF's "device:GPU" implies CUDA, it is against the spirit of modular TF, which intends to provide a vendor-neutral interface. We are working on CUDA specific code to make it more general.

I joined the discussion late, so please correct me if I missed some context here. As @Jianhui-Li mentioned it would be great if can enable existing TF applications on GPU seamlessly. As we are adding Metal backend support we noticed this dependency as well and as part of pluggable design it gives an opportunity to straighten it out.

@rjpower Is the concern here that the change has wider impact and will break existing code-paths in DEVICE_GPU?

@rohan100jain rohan100jain requested a review from rjpower December 9, 2020 21:31
@gbaned gbaned added the awaiting review Pull request awaiting review label Dec 16, 2020
@gbaned
Copy link
Contributor

gbaned commented Dec 23, 2020

@rjpower Any update on this PR? Please. Thanks!

@gbaned gbaned added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed awaiting review Pull request awaiting review labels Dec 23, 2020
@penpornk
Copy link
Member

@kulinseth Sorry for the late reply! I believe the concern here is that whatever we choose here (having a device_alias or not), it might restrict how TensorFlow does device placement in the future. And there have been several existing issues with device placement to begin with. So some of us wanted to explore the impact on that side more before we proceed here. This PR is not blocking other PluggableDevice PRs and will likely be merged last. (We are still aiming for TF 2.5.)

@gbaned This PR will be on hold for a while until we can make a decision on how we want to move forward.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 25, 2020
@kulinseth
Copy link
Contributor

@penpornk Thanks for the reply. Makes sense and since this is not blocking other PRs, we can take a look later and not hold them up.

@penpornk penpornk removed the API review API Review label Jan 14, 2021
@penpornk
Copy link
Member

(Removing the API review tag for now. Will add it back once the PR is ready for API review again.)

@gbaned
Copy link
Contributor

gbaned commented Feb 4, 2021

@penpornk Any update on this PR? Please. Thanks!

@penpornk
Copy link
Member

penpornk commented Feb 4, 2021

@gbaned It's still on hold. Sorry!

@gbaned
Copy link
Contributor

gbaned commented Apr 21, 2021

@penpornk Any update on this PR? Please. Thanks!

@penpornk
Copy link
Member

penpornk commented Apr 21, 2021

@gbaned We are exploring an alternative. I'll temporarily close this PR. We can reopen it if later on if we want to fall back to this one. Sorry for the inconveniences!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes size:M CL Change Size: Medium stat:awaiting tensorflower Status - Awaiting response from tensorflower

Projects

None yet

Development

Successfully merging this pull request may close these issues.