WIP::Added out-of-tree support for Torch-Vitis FPGA kernels by dylanbespalko · Pull Request #37378 · pytorch/pytorch

dylanbespalko · 2020-04-27T22:31:11Z

Out-of-Tree PyTorch-Vitis FPGA support is found here
Xilinx Vitis Webpage is found here
List of supported FPGAs is found here

Modifications in the PR:

c10 backends and dispatch keys are modified to include FPGA.
gen.py is slightly modified so that I can call it from Torch-Vitis to create FPGAType.h and FPGAType.cpp.
I have made a couple changes that should not be done in-tree. These need discussion.
- aten/src/ATen/native/DispatchStub.h
- aten/src/ATen/native/Copy.cpp
- torch/csrc/utils/tensor_layouts.cpp

Eager mode:
The solution uses OpenCL to move data to the FPGA only when calling kernels (ie. add_stub()), therefore it re-uses cpu code when running in eager mode. If there is a TORCH_CHECK(self.device().type() == DeviceType::CPU,...) blocking execution, I simply copy the file into my extension and add a FPGA specific dispatch in native_functions.yaml

JIT mode:
This connects multiple FPGA kernels together on the FPGA, thus bypassing the cpu code. This mode only supports the default function arguments.

I currently do not need to import DispatchStub.h anywhere in my code for the above reasons. I assume, I will need to:
1. Create a FPGADispatchStub.h that specializes DispatchStub<rT (*)(Args...), T> somehow ....
2. Import FPGADispatchStub.h in my cpp kernel files.

Please let me know how I can make these changes.

@ezyang @anjali411

cc: @ tataetae

dr-ci · 2020-04-27T22:37:26Z

💊 Build failures summary and remediations

As of commit e8a812e (more details on the Dr. CI page):

1/10 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
9/10 broken upstream at merge base b3ada29 on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD
```
Check out the recency history of this "viable master" tracking branch.

🚧 9 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed:

pytorch_windows_vs2019_py36_cpu_test2 on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_windows_vs2019_py36_cuda10.1_test2 on Apr 27 from 2:20pm to 4:19pm PDT (10 commits; e33c3e4 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_macos_10_13_py3_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_legacy_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 12 times.

…_float3

gchanan

did you mean to have fbgemm changes?

gchanan · 2020-04-29T15:35:15Z

+    elif env['DeviceType'] == 'CUDA':
        top_env['cuda_type_headers'].append(
            '#include <ATen/{}.h>'.format(env['Type']))
+    else:  # Allow gen.py to be called by out-of-tree devices


why do you need this?

I start with a dictionary of kernel_name: fpga_dispatch_function.

I copy the latest native_functions.yaml from PyTorch and add FPGA: fpga_dispatch_funtion under the dispatch entries. See mul example:

func: mul.Tensor(Tensor self, Tensor other) -> Tensor use_c10_dispatcher: full variants: function, method dispatch: FPGA: mul (I added this before calling your gen.py) CPU: mul CUDA: mul SparseCPU: mul_sparse SparseCUDA: mul_sparse MkldnnCPU: mkldnn_mul

I then call gen.py and it generates FPGAType.h and FPGAType.cpp and I copy these files to my extension repo.

I can then re-sync with pytorch every time native_functions.yaml changes.

all you really need here is a way to call the CPU "mul.Tensor" implementation, right? The rest is just native_function mechanics that don't really matter for your use case AFAICT.

idk if the dispatcher can do that, but I'm sure @smessmer or @ezyang knows.

maybe some of the BackendSelect-style "redispatch with key" stuff could work.

@dylanbespalko Can you paste an example of generated header/cpp files here?

To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

Can you paste an example of generated header/cpp files here?

Attaching files appears to be broken, so I'll link you to the files on gitlab.
FPGAType.cpp
FPGAType.h

I have a script called gen.py that appends FPGA to the list of backends in the Pytorch gen.py, which then generates FPGAType.h and FPGAType.cpp.

To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

That file gives me enough information to dispatch at the device level, however the FPGAs that Xilinx markets towards Tensorflow and Caffe are always CPU + FPGA SoCs. The idea is the FPGA replaces the logic in the ATen/native/cpu folder, while the CPU is still used for the ATen/native folder. I've been operating under the impression that device-level dispatching was done this way.

Hmmm I'm a bit confused. Do you mean FPGA plans to only support ops implemented through TensorIterator?
What happens when we move new ops into ATen/native/cpu folder? Does that break your integration?

@ailzhang

I'm confused here too. I don't understand what your criteria is for logic that goes into ATen/native vs. ATen/native/cpu and I'm worried about this. I would use the following to separate these two:

ATen/native/___.cpp

Any Assert statement

Anything statement that processes the entire tensor.

ie: like a call to another at:: function

ATen/native/cpu/___.cpp (or ATen/native/fpga/___cpp

Only processes tensor data

Simple "Bytes-In, Bytes-Out" interface with few optional or keyword arguments.

Tensor data is consumed:

All at once in eager mode.

Using a streaming interface (Tensor is process one-byte at a time).

gchanan · 2020-04-29T15:35:36Z

 bool is_supported_device(Device device) {
  DeviceType device_type = device.type();
-  return device_type == kCPU || device_type == kCUDA || device_type == kHIP;
+  return device_type == kCPU || device_type == kCUDA || device_type == kHIP || device_type == kFPGA;


see the XLA comment above -- does that not work?

This should work. I will do that.

@gchanan,

Thanks. This worked. I only have to modify DispatchStub.h to make things work now.

gchanan · 2020-04-29T15:37:25Z

  std::atomic<FnPtr> cpu_dispatch_ptr;
  FnPtr cuda_dispatch_ptr;
  FnPtr hip_dispatch_ptr;
+  FnPtr fpga_dispatch_ptr;


why do you need support in DispatchStub? Are you using TensorIterator?

I think this is where I'm really confused. I have replaced the logic in aten/src/ATen/native/cpu with aten/src/ATen/native/fpga. I am still running the code in aten/src/ATen/native to pre-process any optional or keyword arguments because the FPGA can't handle that without reducing performance.

The modern "FPGA" is always a "CPU + FPGA" linked by OpenCL, as the FPGA itself is not flexible enough to pre-process all of the optional keyword arguments to each kernel. For example, I only implement mul_out on the FPGA, not mul and mul_.

void mul_kernel(TensorIterator& iter) { AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.dtype(), "mul_out", [&](){ using vec_t = at::native::ztype<scalar_t>::vec_t; fpga_kernel_no_scalar_vec<vec_t>("v__mul", iter); }); } REGISTER_FPGA_DISPATCH(mul_stub, &mul_kernel);

fpga_kernel_no_scalar_vec streams input data_ptrs from the CPU -> FPGA using OpenCL, and then steam output data_ptrs from the FPGA -> CPU.

I am using TensorIterator and I maybe don't need it. It does make things much easier though.

I currently have no workaround when someone adds a device check in aten/src/ATen/native. I could write a parser to import that code and remove those checks locally, but I could use some advice.

maybe this becomes partially moot once we implement #29844.

For the device check in aten/src/ATen/native i'm not sure how XLA gets around that -- whether we widen the device check or XLA just implements their own thing. @ailzhang do you know?

@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?

ezyang · 2020-04-29T16:14:12Z

        top_env['cuda_type_headers'].append(
            '#include <ATen/{}.h>'.format(env['Type']))
+    else:  # Allow gen.py to be called by out-of-tree devices
+        pass


Oof, this is gonna be rough to support in the long term. @dylanbespalko, did you try doing this manually without gen.py? What were the barriers here?

I knew this would be scary :). This allows me to run gen.py from my out-of-tree extension so that I can update myself whenever native_functions.yaml changes. When building pytorch, the solution works for me, but I will need to verify this with CI.

dylanbespalko · 2020-04-29T16:50:36Z

did you mean to have fbgemm changes?

Sorry, I goofed. I forgot to sync. I will generate a new PR that references this one.

dylanbespalko · 2020-04-29T17:05:14Z

@gchanan, @ezyang,

FYI. It takes my CI 8 hours for me to build the FPGA code for the UnaryOpKernels, BinaryOpKernels, ReduceOpKernels and SpectralOpsKernels for Float and ComplexFloat dtypes. This can achieve a ~20X runtime improvement over the CPU, however the build time is insane.

This is why I need to handle optional and keyword arguments on the CPU.

gchanan · 2020-04-29T18:51:02Z

  // for now, let's look these up by Backend; we could create our own enum in the future.
  registerLayoutObject((THPLayout*)strided_layout, at::Backend::CPU);
  registerLayoutObject((THPLayout*)strided_layout, at::Backend::CUDA);
+  registerLayoutObject((THPLayout*)strided_layout, at::Backend::FPGA);


just noting I think we should be able get rid of this so you don't have to change this (obviously it's ideal to be able to minimize your changes). I'll look into this separately.

ailzhang · 2020-04-29T18:52:39Z

+    elif env['DeviceType'] == 'CUDA':
        top_env['cuda_type_headers'].append(
            '#include <ATen/{}.h>'.format(env['Type']))
+    else:  # Allow gen.py to be called by out-of-tree devices


@dylanbespalko Can you paste an example of generated header/cpp files here?

ailzhang · 2020-04-29T18:59:05Z

  std::atomic<FnPtr> cpu_dispatch_ptr;
  FnPtr cuda_dispatch_ptr;
  FnPtr hip_dispatch_ptr;
+  FnPtr fpga_dispatch_ptr;


@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?

ailzhang · 2020-04-29T19:01:11Z

+    elif env['DeviceType'] == 'CUDA':
        top_env['cuda_type_headers'].append(
            '#include <ATen/{}.h>'.format(env['Type']))
+    else:  # Allow gen.py to be called by out-of-tree devices


To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

ailzhang

Ehh sorry I clicked approve by mistake....

ezyang · 2020-04-30T14:20:00Z

@dylanbespalko One big question I have for you: how important is it to you to get this merged to master earlier versus later? If there is no rush, we can take the time to do some more cleanups and get things nicer. If you have some reason you need this in earlier, we'll need to ask questions about what is OK to leave a little bit weird, and what we can push further on.

dylanbespalko · 2020-04-30T14:41:06Z

@ezyang,

No hurry on my side. Applying these patches is very easy and I'm looking into how to fix some of the issues on my end too. Let's use this thread to discuss where the changes need to happen.

gchanan · 2020-04-30T21:51:52Z

Just registering that #37527 merged, so you should be able to remove your changes in tensor_layouts.cpp.

dylanbespalko · 2020-05-02T10:41:08Z

@ailzhang

Sorry, I somehow missed your messages. In the past couple days I have figured out the copy_from kernel from XLA and PyTorch removed the change I needed to make in tensor_layouts.cpp. That just leaves the changes in DispatchStub.h that uses TensorIterator.

Is FPGA a new device?

No, it's older than me and I'm 35. It is commonly used by Electrical and Computer Engineers in high-speed wireless/wired communication and real-time image processing. It is a processor that consists of many user-programmable switches (called gates) that are configured to create custom circuits that perform math really fast. It can go beyond the speed of the GPU and TPU but is expensive and time consuming to develop for.

Until 5 years ago you could only program an FPGA using Verilog/VHDL, which is like assembly code. High-Level Synthesis (HLS) allows software engineers to program FPGAs using C++ instead of Verilog and Xilinx Vitis is the latest version of HLS that supports both server FPGAs and embedded FPGAs.

FPGA cores are being built into server processors and embedded devices. Intel and Xilinx are the two manufacturers.

Does FPGA implement all ops on its own?

No, but Xilinx is posting example code at Vitis_Libraries that specifically implements some of the Tensorflow and Caffe math kernels.

I have implemented generic Xilinx FPGA support that produces 20X speedup over CPU, but application specific optimization will get you 60X. Open-Source makes that possible.

Creating Kernel (Ops) Objects on the FPGA:

Vitis compiles objects using "v++ -c". You can compile an infinite number of objects.
Vitis links object using "v++ -l". You can only link a finite number of objects. Either you run out of memory ports (the link between the CPU and FPGA) or gates in the FPGA (run out of space).

Wrapping Kernel Objects on the CPU:
OpenCL is used to create a client (CPU) - server (FPGA) relationship. You can think of the FPGA as a Web Server that stores math kernels instead of HTML files.

The CPU compiles kernel wrappers with "g++ -c".
The CPU links kernel wrappers with "g++ -l".

What'll be the relationship between CPU op and FPGA op in this case?

Here is process for optimizing an algorithm on the FPGA.

When running in eager execution mode, the cpu will check assertions, and process optional or keyword arguments. THIS IS HUGE, because I've spent 6-months writing the HLS for the FPGA kernels. PyTorch provided an amazing environment for prototyping. Most of the keyword arguments in numpy would never make it into the FPGA kernel implementation.
At this point, you need to refine your algorithms to avoid non-default keyword arguments. For example, you may need to explicitly transpose or normalize data with some extra code.
The final solution exports the PyTorch graph to a Vitis Config File, which calls "v++ -l" to link the FPGA objects into a graph on the FPGA. In this mode, only the initial input and final output kernel can be processed by the CPU wrapper. TensorIterator provides the right interface here because it stores information that I need to pass to OpenCL and it gives me a common interface for all FPGA kernels*

Do you have to use TensorIterator btw?

TensorIterator made it possible to have a single function fpga_kernel() that dispatches the input and output variables to the FPGA kernel. It works for more cases than cpu_kernel() on the CPU because of the generic client-server relationship. This interface will also make it easier to export the tensor graph to the FPGA.

Having a single interface between the hw (FPGA) and the sw (CPU) made it possible to disable kernels without compile-time errors. I design individual kernels during the day, which can take 5-10 mins to compile. Then I run the entire build overnight, which takes 8 hours. That's huge.

dylanbespalko · 2020-05-04T17:20:24Z

@ailzhang, @ezyang,

I need to spend the week studying how to stream Tensors between FPGA kernels. I would like to have a phone discussion after that. Are you available for a phone call later this week or early next week?

ezyang · 2020-05-04T20:49:10Z

I would like to have a phone discussion after that. Are you available for a phone call later this week or early next week?

Sure. About streaming, or something else?

dylanbespalko · 2020-05-04T21:06:53Z

Topics I would like to discuss:

How to deal with the fact that a FPGA is really a CPU + FPGA.
Why I haven't used lambda expressions to call FPGA kernels.
Should I use TensorIterator or std::tuple<Tensor, Tensor, ...> (C++17) for tensor inputs to FPGA?
Streaming data between FPGA kernels using PyTorch Graph Mode.
TorchScript: Trace vs. Script mode and how to apply this to multiple FPGA kernels.

I'll send you my contact info on Slack.

pytorchbot added the open source label Apr 27, 2020

Dylan Bespalko added 5 commits April 27, 2020 16:49

Added initial support for FPGA device

a6426fe

Changed gen.py so that it can be called by out-of-tree device code

006d605

Removing modifications to UnaryOps.cpp

0e7bd76

Rebase with pytorch/master

7f85e1a

Merge branch 'master' of github.com:pytorch/pytorch into fpga_strided…

e8a812e

…_float3

dylanbespalko force-pushed the fpga_strided_float3 branch from 4014599 to e8a812e Compare April 27, 2020 22:57

ezyang self-requested a review April 29, 2020 14:43

gchanan reviewed Apr 29, 2020

View reviewed changes

ezyang reviewed Apr 29, 2020

View reviewed changes

gchanan reviewed Apr 29, 2020

View reviewed changes

dylanbespalko changed the title ~~Added out-of-tree support for Torch-Vitis FPGA kernels~~ WIP::Added out-of-tree support for Torch-Vitis FPGA kernels Apr 29, 2020

ailzhang approved these changes Apr 29, 2020

View reviewed changes

ailzhang requested changes Apr 29, 2020

View reviewed changes

gchanan mentioned this pull request Apr 29, 2020

Remove Backend -> THPLayout mapping. #37527

Closed

dylanbespalko mentioned this pull request May 2, 2020

WIP: Added out-of-tree support for Torch-Vitis FPGA kernels Testing #37706

Closed

dylanbespalko closed this May 20, 2020

Conversation

dylanbespalko commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

🚧 9 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

gchanan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ATen/native/___.cpp

ATen/native/cpu/___.cpp (or ATen/native/fpga/___cpp

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanbespalko Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanbespalko Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylanbespalko commented Apr 29, 2020

Uh oh!

dylanbespalko commented Apr 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ailzhang left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Apr 30, 2020

Uh oh!

dylanbespalko commented Apr 30, 2020

Uh oh!

gchanan commented Apr 30, 2020

Uh oh!

dylanbespalko commented May 2, 2020

Uh oh!

dylanbespalko commented May 4, 2020

Uh oh!

dylanbespalko commented Apr 27, 2020 •

edited

Loading

dr-ci Bot commented Apr 27, 2020 •

edited

Loading

ATen/native/cpu/_.cpp (or ATen/native/fpga/_cpp

dylanbespalko Apr 29, 2020 •

edited

Loading

dylanbespalko Apr 29, 2020 •

edited

Loading