WIP::Added out-of-tree support for Torch-Vitis FPGA kernels#37378
WIP::Added out-of-tree support for Torch-Vitis FPGA kernels#37378dylanbespalko wants to merge 5 commits intopytorch:masterfrom
Conversation
💊 Build failures summary and remediationsAs of commit e8a812e (more details on the Dr. CI page):
🚧 9 fixed upstream failures:These were probably caused by upstream breakages that were already fixed:
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 12 times. |
4014599 to
e8a812e
Compare
gchanan
left a comment
There was a problem hiding this comment.
did you mean to have fbgemm changes?
| elif env['DeviceType'] == 'CUDA': | ||
| top_env['cuda_type_headers'].append( | ||
| '#include <ATen/{}.h>'.format(env['Type'])) | ||
| else: # Allow gen.py to be called by out-of-tree devices |
There was a problem hiding this comment.
- I start with a dictionary of
kernel_name: fpga_dispatch_function. - I copy the latest native_functions.yaml from PyTorch and add
FPGA: fpga_dispatch_funtionunder thedispatch entries. See mul example:
func: mul.Tensor(Tensor self, Tensor other) -> Tensor
use_c10_dispatcher: full
variants: function, method
dispatch:
FPGA: mul (I added this before calling your gen.py)
CPU: mul
CUDA: mul
SparseCPU: mul_sparse
SparseCUDA: mul_sparse
MkldnnCPU: mkldnn_mul
- I then call gen.py and it generates FPGAType.h and FPGAType.cpp and I copy these files to my extension repo.
I can then re-sync with pytorch every time native_functions.yaml changes.
There was a problem hiding this comment.
maybe some of the BackendSelect-style "redispatch with key" stuff could work.
There was a problem hiding this comment.
@dylanbespalko Can you paste an example of generated header/cpp files here?
There was a problem hiding this comment.
To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?
There was a problem hiding this comment.
Can you paste an example of generated header/cpp files here?
Attaching files appears to be broken, so I'll link you to the files on gitlab.
FPGAType.cpp
FPGAType.h
I have a script called gen.py that appends FPGA to the list of backends in the Pytorch gen.py, which then generates FPGAType.h and FPGAType.cpp.
To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?
That file gives me enough information to dispatch at the device level, however the FPGAs that Xilinx markets towards Tensorflow and Caffe are always CPU + FPGA SoCs. The idea is the FPGA replaces the logic in the ATen/native/cpu folder, while the CPU is still used for the ATen/native folder. I've been operating under the impression that device-level dispatching was done this way.
There was a problem hiding this comment.
Hmmm I'm a bit confused. Do you mean FPGA plans to only support ops implemented through TensorIterator?
What happens when we move new ops into ATen/native/cpu folder? Does that break your integration?
There was a problem hiding this comment.
I'm confused here too. I don't understand what your criteria is for logic that goes into ATen/native vs. ATen/native/cpu and I'm worried about this. I would use the following to separate these two:
ATen/native/___.cpp
- Any Assert statement
- Anything statement that processes the entire tensor.
- ie: like a call to another at:: function
ATen/native/cpu/___.cpp (or ATen/native/fpga/___cpp
- Only processes tensor data
- Simple "Bytes-In, Bytes-Out" interface with few optional or keyword arguments.
- Tensor data is consumed:
- All at once in eager mode.
- Using a streaming interface (Tensor is process one-byte at a time).
| bool is_supported_device(Device device) { | ||
| DeviceType device_type = device.type(); | ||
| return device_type == kCPU || device_type == kCUDA || device_type == kHIP; | ||
| return device_type == kCPU || device_type == kCUDA || device_type == kHIP || device_type == kFPGA; |
There was a problem hiding this comment.
see the XLA comment above -- does that not work?
There was a problem hiding this comment.
This should work. I will do that.
There was a problem hiding this comment.
Thanks. This worked. I only have to modify DispatchStub.h to make things work now.
| std::atomic<FnPtr> cpu_dispatch_ptr; | ||
| FnPtr cuda_dispatch_ptr; | ||
| FnPtr hip_dispatch_ptr; | ||
| FnPtr fpga_dispatch_ptr; |
There was a problem hiding this comment.
why do you need support in DispatchStub? Are you using TensorIterator?
There was a problem hiding this comment.
I think this is where I'm really confused. I have replaced the logic in aten/src/ATen/native/cpu with aten/src/ATen/native/fpga. I am still running the code in aten/src/ATen/native to pre-process any optional or keyword arguments because the FPGA can't handle that without reducing performance.
The modern "FPGA" is always a "CPU + FPGA" linked by OpenCL, as the FPGA itself is not flexible enough to pre-process all of the optional keyword arguments to each kernel. For example, I only implement mul_out on the FPGA, not mul and mul_.
void mul_kernel(TensorIterator& iter) {
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES(iter.dtype(), "mul_out", [&](){
using vec_t = at::native::ztype<scalar_t>::vec_t;
fpga_kernel_no_scalar_vec<vec_t>("v__mul", iter);
});
}
REGISTER_FPGA_DISPATCH(mul_stub, &mul_kernel);fpga_kernel_no_scalar_vec streams input data_ptrs from the CPU -> FPGA using OpenCL, and then steam output data_ptrs from the FPGA -> CPU.
I am using TensorIterator and I maybe don't need it. It does make things much easier though.
I currently have no workaround when someone adds a device check in aten/src/ATen/native. I could write a parser to import that code and remove those checks locally, but I could use some advice.
There was a problem hiding this comment.
@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?
| top_env['cuda_type_headers'].append( | ||
| '#include <ATen/{}.h>'.format(env['Type'])) | ||
| else: # Allow gen.py to be called by out-of-tree devices | ||
| pass |
There was a problem hiding this comment.
Oof, this is gonna be rough to support in the long term. @dylanbespalko, did you try doing this manually without gen.py? What were the barriers here?
There was a problem hiding this comment.
I knew this would be scary :). This allows me to run gen.py from my out-of-tree extension so that I can update myself whenever native_functions.yaml changes. When building pytorch, the solution works for me, but I will need to verify this with CI.
Sorry, I goofed. I forgot to sync. I will generate a new PR that references this one. |
|
FYI. It takes my CI 8 hours for me to build the FPGA code for the UnaryOpKernels, BinaryOpKernels, ReduceOpKernels and SpectralOpsKernels for Float and ComplexFloat dtypes. This can achieve a ~20X runtime improvement over the CPU, however the build time is insane. This is why I need to handle optional and keyword arguments on the CPU. |
| // for now, let's look these up by Backend; we could create our own enum in the future. | ||
| registerLayoutObject((THPLayout*)strided_layout, at::Backend::CPU); | ||
| registerLayoutObject((THPLayout*)strided_layout, at::Backend::CUDA); | ||
| registerLayoutObject((THPLayout*)strided_layout, at::Backend::FPGA); |
There was a problem hiding this comment.
just noting I think we should be able get rid of this so you don't have to change this (obviously it's ideal to be able to minimize your changes). I'll look into this separately.
| elif env['DeviceType'] == 'CUDA': | ||
| top_env['cuda_type_headers'].append( | ||
| '#include <ATen/{}.h>'.format(env['Type'])) | ||
| else: # Allow gen.py to be called by out-of-tree devices |
There was a problem hiding this comment.
@dylanbespalko Can you paste an example of generated header/cpp files here?
| std::atomic<FnPtr> cpu_dispatch_ptr; | ||
| FnPtr cuda_dispatch_ptr; | ||
| FnPtr hip_dispatch_ptr; | ||
| FnPtr fpga_dispatch_ptr; |
There was a problem hiding this comment.
@gchanan XLA is fine here since XLA doesn't use TensorIterator so it's dispatched to XLA at Device level (happens before TensorIterator).
@dylanbespalko Sorry I don't have full context here, so maybe asking a few naive questions: Do you have to use TensorIterator btw? Is FPGA a new device? More precisely, does FPGA implement all ops on its own, what'll be the relationship between CPU op and FPGA op in this case?
| elif env['DeviceType'] == 'CUDA': | ||
| top_env['cuda_type_headers'].append( | ||
| '#include <ATen/{}.h>'.format(env['Type'])) | ||
| else: # Allow gen.py to be called by out-of-tree devices |
There was a problem hiding this comment.
To clarify if you look at torch/csrc/autograd/generated/RegistrationDeclarations.h(after you build pytorch) that's all information XLA need to register ops in pytorch. Maybe we could reuse that for FPGA?
ailzhang
left a comment
There was a problem hiding this comment.
Ehh sorry I clicked approve by mistake....
|
@dylanbespalko One big question I have for you: how important is it to you to get this merged to master earlier versus later? If there is no rush, we can take the time to do some more cleanups and get things nicer. If you have some reason you need this in earlier, we'll need to ask questions about what is OK to leave a little bit weird, and what we can push further on. |
|
No hurry on my side. Applying these patches is very easy and I'm looking into how to fix some of the issues on my end too. Let's use this thread to discuss where the changes need to happen. |
|
Just registering that #37527 merged, so you should be able to remove your changes in tensor_layouts.cpp. |
|
Sorry, I somehow missed your messages. In the past couple days I have figured out the
No, it's older than me and I'm 35. It is commonly used by Electrical and Computer Engineers in high-speed wireless/wired communication and real-time image processing. It is a processor that consists of many user-programmable switches (called gates) that are configured to create custom circuits that perform math really fast. It can go beyond the speed of the GPU and TPU but is expensive and time consuming to develop for. Until 5 years ago you could only program an FPGA using Verilog/VHDL, which is like assembly code. High-Level Synthesis (HLS) allows software engineers to program FPGAs using C++ instead of Verilog and Xilinx Vitis is the latest version of HLS that supports both server FPGAs and embedded FPGAs. FPGA cores are being built into server processors and embedded devices. Intel and Xilinx are the two manufacturers.
No, but Xilinx is posting example code at Vitis_Libraries that specifically implements some of the Tensorflow and Caffe math kernels. I have implemented generic Xilinx FPGA support that produces 20X speedup over CPU, but application specific optimization will get you 60X. Open-Source makes that possible. Creating Kernel (Ops) Objects on the FPGA:
Wrapping Kernel Objects on the CPU:
Here is process for optimizing an algorithm on the FPGA.
TensorIterator made it possible to have a single function Having a single interface between the hw (FPGA) and the sw (CPU) made it possible to disable kernels without compile-time errors. I design individual kernels during the day, which can take 5-10 mins to compile. Then I run the entire build overnight, which takes 8 hours. That's huge. |
Sure. About streaming, or something else? |
|
Topics I would like to discuss:
I'll send you my contact info on Slack. |
Out-of-Tree PyTorch-Vitis FPGA support is found here
Xilinx Vitis Webpage is found here
List of supported FPGAs is found here
Modifications in the PR:
Eager mode:
The solution uses OpenCL to move data to the FPGA only when calling kernels (ie.
add_stub()), therefore it re-uses cpu code when running in eager mode. If there is aTORCH_CHECK(self.device().type() == DeviceType::CPU,...)blocking execution, I simply copy the file into my extension and add a FPGA specific dispatch innative_functions.yamlJIT mode:
This connects multiple FPGA kernels together on the FPGA, thus bypassing the cpu code. This mode only supports the default function arguments.
I currently do not need to import
DispatchStub.hanywhere in my code for the above reasons. I assume, I will need to:1. Create a
FPGADispatchStub.hthat specializesDispatchStub<rT (*)(Args...), T>somehow ....2. Import
FPGADispatchStub.hin my cpp kernel files.Please let me know how I can make these changes.
@ezyang @anjali411
cc: @ tataetae