Replace empty_affine_quantizer with direct dispatch to at::native::empty_affine.. .#36814
Replace empty_affine_quantizer with direct dispatch to at::native::empty_affine.. .#36814kimishpatel wants to merge 9 commits intogh/kimishpatel/3/basefrom
Conversation
Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
💊 Build failures summary and remediationsAs of commit 729b708 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 35 times. |
|
Can you update the commit message to explain the motivation for this change? |
Updated. |
| const auto b_scale = qb_contig.q_scale(); | ||
|
|
||
| Tensor qy = at::_empty_affine_quantized( | ||
| Tensor qy = at::new_qtensor_cpu( |
There was a problem hiding this comment.
could you use at::native:: empty_affine_quantized?
There was a problem hiding this comment.
Whey do you think that would be better? Does it get around the dispatch overhead?
There was a problem hiding this comment.
It is better because it has the same API as at::_empty_affine_quantized.
yes, it will get around the dispatch overhead.
There was a problem hiding this comment.
Sure. Let me try.
There was a problem hiding this comment.
This is still slower, but not a whole lot. So I think it can make for a decent compromise.
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
| #include <ATen/native/quantized/cpu/quantized_ops.h> | ||
| #include <ATen/native/quantized/cpu/init_qnnpack.h> | ||
| #include <ATen/native/quantized/cpu/qnnpack_utils.h> | ||
| #include <c10/core/TensorOptions.h> |
There was a problem hiding this comment.
can these includes be removed?
| #include <ATen/quantized/Quantizer.h> | ||
| #include <ATen/native/quantized/cpu/fbgemm_utils.h> | ||
| #include <ATen/native/quantized/cpu/qnnpack_utils.h> | ||
| #include <c10/core/TensorOptions.h> |
There was a problem hiding this comment.
can these includes be removed?
There was a problem hiding this comment.
Yes, I was gonna do that.
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
…:native::empty_affine.. ." From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
…:native::empty_affine.. ." From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
…:native::empty_affine.. ." From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable Differential Revision: [D21093840](https://our.internmc.facebook.com/intern/diff/D21093840/) [ghstack-poisoned]
|
This pull request has been merged in 1510bdd. |
Summary: Pull Request resolved: pytorch#36814 ghstack-source-id: 103218412 From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable {F234432545} Test Plan: Quantized op tests. Reviewed By: jerryzh168 Differential Revision: D21093840 fbshipit-source-id: 1b98b57eae403353596fc31171069d2f43b13385
Stack from ghstack:
From the flamegraph it seems 40% the time we are spending going through the dispatch stack. I think in quantized model where compute can take less time, such overheads become noticeable
Differential Revision: D21093840