Skip to content

WIP::Added out-of-tree support for Torch-Vitis FPGA kernels#37378

Closed
dylanbespalko wants to merge 5 commits intopytorch:masterfrom
dylanbespalko:fpga_strided_float3
Closed

WIP::Added out-of-tree support for Torch-Vitis FPGA kernels#37378
dylanbespalko wants to merge 5 commits intopytorch:masterfrom
dylanbespalko:fpga_strided_float3

Conversation

@dylanbespalko
Copy link
Copy Markdown
Contributor

@dylanbespalko dylanbespalko commented Apr 27, 2020

Out-of-Tree PyTorch-Vitis FPGA support is found here
Xilinx Vitis Webpage is found here
List of supported FPGAs is found here

Modifications in the PR:

  • c10 backends and dispatch keys are modified to include FPGA.
  • gen.py is slightly modified so that I can call it from Torch-Vitis to create FPGAType.h and FPGAType.cpp.
  • I have made a couple changes that should not be done in-tree. These need discussion.
    • aten/src/ATen/native/DispatchStub.h
    • aten/src/ATen/native/Copy.cpp
    • torch/csrc/utils/tensor_layouts.cpp

Eager mode:
The solution uses OpenCL to move data to the FPGA only when calling kernels (ie. add_stub()), therefore it re-uses cpu code when running in eager mode. If there is a TORCH_CHECK(self.device().type() == DeviceType::CPU,...) blocking execution, I simply copy the file into my extension and add a FPGA specific dispatch in native_functions.yaml

JIT mode:
This connects multiple FPGA kernels together on the FPGA, thus bypassing the cpu code. This mode only supports the default function arguments.

I currently do not need to import DispatchStub.h anywhere in my code for the above reasons. I assume, I will need to:
1. Create a FPGADispatchStub.h that specializes DispatchStub<rT (*)(Args...), T> somehow ....
2. Import FPGADispatchStub.h in my cpp kernel files.

Please let me know how I can make these changes.

@ezyang @anjali411

cc: @ tataetae

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 27, 2020

💊 Build failures summary and remediations

As of commit e8a812e (more details on the Dr. CI page):


  • 1/10 failures possibly* introduced in this PR

    • 1/1 non-CircleCI failure(s)
  • 9/10 broken upstream at merge base b3ada29 on Apr 27 from 2:24pm to 4:19pm PDT (9 commits; 5fab4c3 - 20143e5)

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


🚧 9 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 12 times.

@ezyang ezyang self-requested a review April 29, 2020 14:43
Copy link
Copy Markdown
Contributor

@gchanan gchanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to have fbgemm changes?

Comment thread aten/src/ATen/gen.py
elif env['DeviceType'] == 'CUDA':
top_env['cuda_type_headers'].append(
'#include <ATen/{}.h>'.format(env['Type']))
else: # Allow gen.py to be called by out-of-tree devices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I start with a dictionary of kernel_name: fpga_dispatch_function.
  2. I copy the latest native_functions.yaml from PyTorch and add FPGA: fpga_dispatch_funtion under the dispatch entries. See mul example:
 func: mul.Tensor(Tensor self, Tensor other) -> Tensor
  use_c10_dispatcher: full
  variants: function, method
  dispatch:
    FPGA: mul  (I added this before calling your gen.py)
    CPU: mul
    CUDA: mul
    SparseCPU: mul_sparse
    SparseCUDA: mul_sparse
    MkldnnCPU: mkldnn_mul
  1. I then call gen.py and it generates FPGAType.h and FPGAType.cpp and I copy these files to my extension repo.

I can then re-sync with pytorch every time native_functions.yaml changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all you really need here is a way to call the CPU "mul.Tensor" implementation, right? The rest is just native_function mechanics that don't really matter for your use case AFAICT.

idk if the dispatcher can do that, but I'm sure @smessmer or @ezyang knows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some of the BackendSelect-style "redispatch with key" stuff could work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylanbespalko Can you paste an example of generated header/cpp files here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you paste an example of generated header/cpp files here?

Attaching files appears to be broken, so I'll link you to the files on gitlab.
FPGAType.cpp
FPGAType.h

I have a script called gen.py that appends FPGA to the list of backends in the Pytorch gen.py, which then generates FPGAType.h and FPGAType.cpp.

To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

That file gives me enough information to dispatch at the device level, however the FPGAs that Xilinx markets towards Tensorflow and Caffe are always CPU + FPGA SoCs. The idea is the FPGA replaces the logic in the ATen/native/cpu folder, while the CPU is still used for the ATen/native folder. I've been operating under the impression that device-level dispatching was done this way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I'm a bit confused. Do you mean FPGA plans to only support ops implemented through TensorIterator?
What happens when we move new ops into ATen/native/cpu folder? Does that break your integration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ailzhang

I'm confused here too. I don't understand what your criteria is for logic that goes into ATen/native vs. ATen/native/cpu and I'm worried about this. I would use the following to separate these two:

ATen/native/___.cpp

  • Any Assert statement
  • Anything statement that processes the entire tensor.
    • ie: like a call to another at:: function

ATen/native/cpu/___.cpp (or ATen/native/fpga/___cpp

  • Only processes tensor data
  • Simple "Bytes-In, Bytes-Out" interface with few optional or keyword arguments.
  • Tensor data is consumed:
    • All at once in eager mode.
    • Using a streaming interface (Tensor is process one-byte at a time).

bool is_supported_device(Device device) {
DeviceType device_type = device.type();
return device_type == kCPU || device_type == kCUDA || device_type == kHIP;
return device_type == kCPU || device_type == kCUDA || device_type == kHIP || device_type == kFPGA;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the XLA comment above -- does that not work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work. I will do that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan,

Thanks. This worked. I only have to modify DispatchStub.h to make things work now.

std::atomic<FnPtr> cpu_dispatch_ptr;
FnPtr cuda_dispatch_ptr;
FnPtr hip_dispatch_ptr;
FnPtr fpga_dispatch_ptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need support in DispatchStub? Are you using TensorIterator?

Copy link
Copy Markdown
Contributor Author

@dylanbespalko dylanbespalko Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where I'm really confused. I have replaced the logic in aten/src/ATen/native/cpu with aten/src/ATen/native/fpga. I am still running the code in aten/src/ATen/native to pre-process any optional or keyword arguments because the FPGA can't handle that without reducing performance.

The modern "FPGA" is always a "CPU + FPGA" linked by OpenCL, as the FPGA itself is not flexible enough to pre-process all of the optional keyword arguments to each kernel. For example, I only implement mul_out on the FPGA, not mul and mul_.

void mul_kernel(TensorIterator& iter) {
  AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.dtype(), "mul_out", [&](){
    using vec_t = at::native::ztype<scalar_t>::vec_t;
    fpga_kernel_no_scalar_vec<vec_t>("v__mul", iter);
  });
}

REGISTER_FPGA_DISPATCH(mul_stub, &mul_kernel);

fpga_kernel_no_scalar_vec streams input data_ptrs from the CPU -> FPGA using OpenCL, and then steam output data_ptrs from the FPGA -> CPU.

I am using TensorIterator and I maybe don't need it. It does make things much easier though.

I currently have no workaround when someone adds a device check in aten/src/ATen/native. I could write a parser to import that code and remove those checks locally, but I could use some advice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this becomes partially moot once we implement #29844.

For the device check in aten/src/ATen/native i'm not sure how XLA gets around that -- whether we widen the device check or XLA just implements their own thing. @ailzhang do you know?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?

Comment thread aten/src/ATen/gen.py
top_env['cuda_type_headers'].append(
'#include <ATen/{}.h>'.format(env['Type']))
else: # Allow gen.py to be called by out-of-tree devices
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, this is gonna be rough to support in the long term. @dylanbespalko, did you try doing this manually without gen.py? What were the barriers here?

Copy link
Copy Markdown
Contributor Author

@dylanbespalko dylanbespalko Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew this would be scary :). This allows me to run gen.py from my out-of-tree extension so that I can update myself whenever native_functions.yaml changes. When building pytorch, the solution works for me, but I will need to verify this with CI.

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

did you mean to have fbgemm changes?

Sorry, I goofed. I forgot to sync. I will generate a new PR that references this one.

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

@gchanan, @ezyang,

FYI. It takes my CI 8 hours for me to build the FPGA code for the UnaryOpKernels, BinaryOpKernels, ReduceOpKernels and SpectralOpsKernels for Float and ComplexFloat dtypes. This can achieve a ~20X runtime improvement over the CPU, however the build time is insane.

This is why I need to handle optional and keyword arguments on the CPU.

// for now, let's look these up by Backend; we could create our own enum in the future.
registerLayoutObject((THPLayout*)strided_layout, at::Backend::CPU);
registerLayoutObject((THPLayout*)strided_layout, at::Backend::CUDA);
registerLayoutObject((THPLayout*)strided_layout, at::Backend::FPGA);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting I think we should be able get rid of this so you don't have to change this (obviously it's ideal to be able to minimize your changes). I'll look into this separately.

@dylanbespalko dylanbespalko changed the title Added out-of-tree support for Torch-Vitis FPGA kernels WIP::Added out-of-tree support for Torch-Vitis FPGA kernels Apr 29, 2020
Comment thread aten/src/ATen/gen.py
elif env['DeviceType'] == 'CUDA':
top_env['cuda_type_headers'].append(
'#include <ATen/{}.h>'.format(env['Type']))
else: # Allow gen.py to be called by out-of-tree devices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylanbespalko Can you paste an example of generated header/cpp files here?

std::atomic<FnPtr> cpu_dispatch_ptr;
FnPtr cuda_dispatch_ptr;
FnPtr hip_dispatch_ptr;
FnPtr fpga_dispatch_ptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?

Comment thread aten/src/ATen/gen.py
elif env['DeviceType'] == 'CUDA':
top_env['cuda_type_headers'].append(
'#include <ATen/{}.h>'.format(env['Type']))
else: # Allow gen.py to be called by out-of-tree devices
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?

Copy link
Copy Markdown
Contributor

@ailzhang ailzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh sorry I clicked approve by mistake....

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Apr 30, 2020

@dylanbespalko One big question I have for you: how important is it to you to get this merged to master earlier versus later? If there is no rush, we can take the time to do some more cleanups and get things nicer. If you have some reason you need this in earlier, we'll need to ask questions about what is OK to leave a little bit weird, and what we can push further on.

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

@ezyang,

No hurry on my side. Applying these patches is very easy and I'm looking into how to fix some of the issues on my end too. Let's use this thread to discuss where the changes need to happen.

@gchanan
Copy link
Copy Markdown
Contributor

gchanan commented Apr 30, 2020

Just registering that #37527 merged, so you should be able to remove your changes in tensor_layouts.cpp.

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

@ailzhang

Sorry, I somehow missed your messages. In the past couple days I have figured out the copy_from kernel from XLA and PyTorch removed the change I needed to make in tensor_layouts.cpp. That just leaves the changes in DispatchStub.h that uses TensorIterator.

Is FPGA a new device?

No, it's older than me and I'm 35. It is commonly used by Electrical and Computer Engineers in high-speed wireless/wired communication and real-time image processing. It is a processor that consists of many user-programmable switches (called gates) that are configured to create custom circuits that perform math really fast. It can go beyond the speed of the GPU and TPU but is expensive and time consuming to develop for.

Until 5 years ago you could only program an FPGA using Verilog/VHDL, which is like assembly code. High-Level Synthesis (HLS) allows software engineers to program FPGAs using C++ instead of Verilog and Xilinx Vitis is the latest version of HLS that supports both server FPGAs and embedded FPGAs.

FPGA cores are being built into server processors and embedded devices. Intel and Xilinx are the two manufacturers.

Does FPGA implement all ops on its own?

No, but Xilinx is posting example code at Vitis_Libraries that specifically implements some of the Tensorflow and Caffe math kernels.

I have implemented generic Xilinx FPGA support that produces 20X speedup over CPU, but application specific optimization will get you 60X. Open-Source makes that possible.

Creating Kernel (Ops) Objects on the FPGA:

  1. Vitis compiles objects using "v++ -c". You can compile an infinite number of objects.
  2. Vitis links object using "v++ -l". You can only link a finite number of objects. Either you run out of memory ports (the link between the CPU and FPGA) or gates in the FPGA (run out of space).

Wrapping Kernel Objects on the CPU:
OpenCL is used to create a client (CPU) - server (FPGA) relationship. You can think of the FPGA as a Web Server that stores math kernels instead of HTML files.

  1. The CPU compiles kernel wrappers with "g++ -c".
  2. The CPU links kernel wrappers with "g++ -l".

What'll be the relationship between CPU op and FPGA op in this case?

Here is process for optimizing an algorithm on the FPGA.

  1. When running in eager execution mode, the cpu will check assertions, and process optional or keyword arguments. THIS IS HUGE, because I've spent 6-months writing the HLS for the FPGA kernels. PyTorch provided an amazing environment for prototyping. Most of the keyword arguments in numpy would never make it into the FPGA kernel implementation.

  2. At this point, you need to refine your algorithms to avoid non-default keyword arguments. For example, you may need to explicitly transpose or normalize data with some extra code.

  3. The final solution exports the PyTorch graph to a Vitis Config File, which calls "v++ -l" to link the FPGA objects into a graph on the FPGA. In this mode, only the initial input and final output kernel can be processed by the CPU wrapper. TensorIterator provides the right interface here because it stores information that I need to pass to OpenCL and it gives me a common interface for all FPGA kernels*

Do you have to use TensorIterator btw?

TensorIterator made it possible to have a single function fpga_kernel() that dispatches the input and output variables to the FPGA kernel. It works for more cases than cpu_kernel() on the CPU because of the generic client-server relationship. This interface will also make it easier to export the tensor graph to the FPGA.

Having a single interface between the hw (FPGA) and the sw (CPU) made it possible to disable kernels without compile-time errors. I design individual kernels during the day, which can take 5-10 mins to compile. Then I run the entire build overnight, which takes 8 hours. That's huge.

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

@ailzhang, @ezyang,

I need to spend the week studying how to stream Tensors between FPGA kernels. I would like to have a phone discussion after that. Are you available for a phone call later this week or early next week?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 4, 2020

I would like to have a phone discussion after that. Are you available for a phone call later this week or early next week?

Sure. About streaming, or something else?

@dylanbespalko
Copy link
Copy Markdown
Contributor Author

Topics I would like to discuss:

  • How to deal with the fact that a FPGA is really a CPU + FPGA.
  • Why I haven't used lambda expressions to call FPGA kernels.
  • Should I use TensorIterator or std::tuple<Tensor, Tensor, ...> (C++17) for tensor inputs to FPGA?
  • Streaming data between FPGA kernels using PyTorch Graph Mode.
  • TorchScript: Trace vs. Script mode and how to apply this to multiple FPGA kernels.

I'll send you my contact info on Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants