Conversation
|
If the percentage of the kernel time in the profile result is minor, I actually think that adding CUDA kernel of Sign is much simpler as it only requires several lines change in the unary elementwise, and it also helps ORT to run inference or forward graph with Sign on CUDA in the future... |
2ea9aa9 to
9220980
Compare
…baijumeswani/abs-grad
Makes sense. I was contemplating whether I should add the Sign cuda kernel or the AbsGrad cuda kernel initially. Made the change now to add the Sign cuda kernel |
…baijumeswani/abs-grad
…baijumeswani/abs-grad
|
Thank you for the review @er3x3 @hariharans29 |
Cherry-pick PRs: #18026 #17912 #17901 “2 lines added whitespace errors when cherry-picking" #17293 #17364 #17505 #17885 This PR contains all the cherry-picks for the patch release except: 1. The PRs marked with sdxl_llama 2. #17772 which has a merge conflict. --------- Co-authored-by: Chi Lo <Chi.Lo@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: Scott McKay <Scott.McKay@microsoft.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Kaz Nishimura <kazssym@linuxfront.com> Co-authored-by: Scott McKay <skottmckay@gmail.com>
l1_loss is defined as:
mean(abs(y1 - y2))If y = abs(x), dy/dx = sign(x).
In onnxruntime,
Signdoes not have a cuda kernel. As a result, the execution graph looks like:MemcpyToHost -> Sign -> MemcpyFromHostThis PR implements the
Signcuda kernel so as to avoid the memcpy.